Sage Journals: Discover world-class research

Abstract

This study examined the alignment and predictive power of instructional practices as reported by teachers, students, and external raters by using the Shanghai data that included 85 teachers and 2,613 students who participated in the Global Teaching Insights study. Results from exploratory and confirmatory factor analysis along with ordinary least square regression indicate that the same four conceptual components including classroom discourse (e.g., allowing students to explain their ideas and engage in peer discussions), meaning-making (e.g., explaining why a mathematical procedure works), cognitive activation (e.g., encouraging students’ critical thinking in solving complex tasks), and clarity instruction (e.g., teachers’ giving clear explanation of subject matter) were identified in the instructional practices reported by teachers and their students. The cognitive activation factor in the data reported by teachers emerged as the most significant predictor of students’ post-test scores, whereas the classroom discourse factor in the data reported by students accounted for the largest portion of variance in students’ post-test scores. Furthermore, our analysis revealed that the alignment between ratings reported by students and external raters was the highest, and student ratings of their mathematics teachers’ instructional practices demonstrated the highest predictive power for students’ post-test scores. Results of this study provide important empirical evidence for the merit of cognitive activation and classroom discourse in mathematics teaching and inspire researchers, practitioners, and policy-makers to pay careful attention to student-reported instructional practices that can serve as a better source of data in measuring mathematics teaching quality.

Keywords

global teaching insights mathematics teaching quality instructional practices student mathematics achievement

1. Introduction

In recent decades, teaching quality has been recognized as the crucial driving force for shaping student achievement and thus identifying the elements of effective instructional practices becomes an important task for researchers, especially in the field of mathematics education (Barber & Mourshed, 2007; Bokhove, 2022; Cheng, 2014; Stigler & Hiebert, 2009). Referring to the overall effectiveness of teaching as often measured by improvement in student academic achievement (Hill et al., 2012; Little et al., 2009), teaching quality is a multifaceted theoretical construct that encompasses various aspects of instructional practices, including specific strategies, techniques, and methods that teachers utilize to facilitate effective teaching and learning experiences in the classroom (Bell et al., 2020; Praetorius et al., 2017, 2018; Van der Lans, 2018). Previous studies have commonly relied on either teachers’ self-reported instructional practices, student surveys, in-classroom observations, or lesson recordings to evaluate the quality of mathematics teachers’ instructional practices. Although useful insight into what constitutes teaching quality either within a specific education system or internationally has emerged, the inconsistent level of agreement among the different types of data remains a critical issue, which leads to confusions or uncertainties regarding the merit of each type of data used in evaluating teaching quality (e.g., Begrich et al., 2020; Desimone et al., 2010; Fauth et al., 2014, 2020; Kaufman et al., 2016; Wagner et al., 2016; Wisniewski et al., 2020).

The discrepancy in the findings of existing literature regarding the alignment of different types of data warrant further exploration of this issue. More importantly, it suggests that multiple sources of evaluation need to be considered when assessing instructional quality, as each source provides valuable but potentially different perspectives on instructional practices. Additionally, achieving higher correlations among the measures provided by teachers’ self-report, student ratings, and observer ratings would provide strong evidence of construct validity for teaching quality that can be used to inform classroom instruction and teaching evaluation (Kunter & Baumert, 2006; Wagner et al., 2016). However, to date, few research has compared all three types of data within one study setting. Therefore, the objective of this study was to address such a gap by examining the alignment among the three types of data and their respective predictive power for student mathematics achievement that serves as a critical indicator for the quality of mathematics teaching (Hill et al., 2012; Little et al., 2009).

2. Literature review

2.1 Instructional practices and student mathematics achievement

Although student mathematics achievement can be influenced by various contextual factors at the school, classroom, and student levels, research has highlighted the pivotal role of teaching quality in attributing to student learning (Hill et al., 2012; Little et al., 2009; Stigler & Hiebert, 2009). Research on teaching quality has largely followed the teacher effectiveness tradition that focused on two important components, that is, the process (teachers’ actual teaching in classroom), and the product (student learning outcomes that teachers eventually foster) (Brophy, 2000; Fenstermacher & Richardson, 2005). Therefore, teachers’ instructional practices in the classroom and student learning outcomes are the two essential elements in assessing teaching quality. Reasonably, selecting valid and reliable approaches to measure teaching quality and identify effective instructional practices that constitute quality teaching become significant.

The common approach to evaluate teaching quality has been to utilize ratings obtained from external observers, while collecting reported data by teachers and students from questionnaires has also been used (Goe et al., 2008; Little et al., 2009). Although other student learning outcomes such as engagement and motivation have been used in measuring teaching quality, standardized student achievement test scores were more often selected due to their affordance for comparisons across different studies (Goe et al., 2008; Grossman et al., 2014; Rieser et al., 2013) and our current study adopts this approach. In the following, we will review the merits of the three types of rating data, the alignment among them, and their predictive power for student mathematics achievement on standardized tests.

2.2 Teachers’ perceived and self-reported instructional practices

Although collecting self-reported data has been popularly used in a large number of studies including those in the field of mathematics education due to their relatively low cost, ease of administration, and advantage of involving a large number of teachers to produce findings of external generalizability (Newfield, 1980; Schneider et al., 2007), debate is still going on with regard to the accuracy of such data related to teacher self-reported classroom instructional practices. Proponents argue that survey about teachers’ instructional practices is reliable and can be used (e.g., Mayer, 1999; McCaffrey et al., 2001; Porter et al., 1993; Ross et al., 2003), and in particular, high reliability and accuracy can be achieved when certain conditions are met, which include that surveys are conducted anonymously (Aquilino, 1994, 1998), respondents are asked to provide demographic information, self-efficacy, and interest (Chan, 2009; Stone et al., 2000) or report on unconnected happenings that occurred in recent and unique scenarios (Bradburn, 2000; Tourangeau et al., 2000). Additionally, high reliability and accuracy can also be obtained when teachers are asked to describe their teaching behaviors instead of judging the quality of their teaching (Mullens & Gayler, 1999), and when researchers use composite variables instead of a single variable from teacher's self-reported data in subsequent data analysis (Mayer, 1999). Teachers’ perceived and self-reported instructional practices appear to align with what is obtained from external observations and interviews (Kaufman et al., 2016; Mayer, 1999; Ross et al., 2003), especially when teachers are prompted to report on the instructional practices they use for a single class or for within a limited time period (Newfield, 1980; Porter et al., 1993).

However, using teachers’ self-reports to measure instructional quality can be challenging as some researchers found self-reported data to have low reliability and less accuracy when respondents are asked to report on their own experiences, attitudes, attributes, or behaviors that are either socially desired or frowned upon (Bradburn, 2000; Devaux & Sassi, 2016; Little et al., 2009; Tourangeau et al., 2000; Van de Vijver & He, 2014). Still, other researchers found that teachers’ self-reported data is unreliable when compared side by side with in-classroom observations in portraying teachers’ instructional practices (Bodzin & Beerer, 2003; Brophy & Good, 1986; Koziol & Burns, 1986) and specifically when this type of data is used to infer the quality of instructional practices (Kaufman et al., 2016; Mayer, 1999). This can be due to various factors, such as a desire to present themselves in a positive light, fear of negative consequences for admitting weaknesses or shortcomings, or a belief that their responses will be used to evaluate their performance or effectiveness (Reddy et al., 2015). Because of these factors, teachers may not always provide accurate or complete information about their instructional practices or the quality of their teaching. As a result, self-reports may not always provide a reliable or valid measure of instructional quality.

Despite the debate related to the strength and weakness of self-reported data, teachers’ self-report on their instructional practices is still commonly used in the field of educational studies. Previous studies using teachers’ self-reported data have provided valuable information about the potential factors influencing teaching and learning practices, though they are limited in its ability to inform the researchers with regards to the actual teaching and learning that take place in the classroom in order to improve such practices (Grossman et al., 2014; Hill et al., 2011; Pianta & Hamre, 2009). The alternative approaches such as on-site observations or ratings based upon lesson recordings tend to be rather time-consuming, costly and also suffer from threats that may reduce the overall reliability and accuracy of the data collected (Praetorius et al., 2012; Reddy et al., 2019). Another popular approach is to involve the students who actually are part of the teaching and learning process in the reporting of instructional practices. In the next section, we will review relevant studies that examined the trustworthiness of this type of data.

2.3 Student rated instructional practices

Similar to teachers’ perceived and self-reported data, students’ reported instructional practices have both merits and limitations. In terms of the merit of this type of data, previous research provided compelling evidence supporting the reliability and validity of student ratings of teaching quality (Fauth et al., 2014; Praetorius et al., 2012; Wagner et al., 2016). In recent years, researchers placed greater emphasis on investigating the degree of stability of student ratings across different school years, classes, or subjects, and critical reflections on the merit of student ratings. For instance, Gaertner and Brunner (2018) found high stability in student-rated instructional practices, unaffected by factors such as elapsed time between ratings, subject taught, or grade level. Similarly, Fauth et al. (2020) found relatively stable student ratings of teaching quality when students from the same class rated the same teacher within two consecutive school years and when the same students rated different teachers. However, Fauth and colleagues (2020) also noted rather low stability in ratings across different classes taught by the same teachers. Reflecting on the value of student ratings, Lauermann and ten Hagen (2021) argued that students, being the authentic participants in the teaching and learning process and natural recipients of teachers’ knowledge transmission, possess a unique perspective in the classroom to observe their teacher's instructional practices on a daily basis. This proximity allows them to critically reflect on and evaluate their teachers’ teaching performance, as their primary purpose is not mere observation but active engagement in the learning. The reliability of student-rated instructional practices is improved since such ratings are aggregated across many lessons in which the students spent significant amount of time (Gaertner & Brunner, 2018; Lauermann & ten Hagen, 2021). Therefore, instructional practices reported by students have good merit and such type of data has been collected in teaching evaluations and educational studies.

However, there exist limitations that can lead to bias in student-rated instructional practices. Some researchers expressed their concerns about the reliability of this type of data by citing students’ young age (Fauth et al., 2019) and untrained nature as raters compared with professional raters who might be well equipped with the necessary skills and expertise to use the evaluation instrument proficiently after going through rigorous training process, since the evaluation instrument contains terms or dimensions of constructs that are beyond the comprehension of young students (Fauth et al., 2020). Accordingly, a potential disagreement might arise between the data obtained from students who experience their teachers’ teaching and those professionally trained raters, even though previous research has provided some evidence for the reliability and validity of ratings from students and external observers (Fauth et al., 2014, 2020; Begrich et al., 2020). Nevertheless, researchers in a most recent study (Tsai et al., 2022) found that students reported teaching effectiveness is in alignment with theory-based structure of the survey completed by students, which helps clear up the concerns expressed by Fauth et al. (2019, 2020) and thus supports that young students at the elementary and secondary level have the capacity of discerning various dimensions of teaching. In the next section, we will review relevant studies that examined the trustworthiness of another commonly used approach of assessing instructional quality, that is, observer ratings.

2.4 Observer rated instructional practices

There is ongoing debate about the extent to which observer ratings can be used to judge teachers’ instructional quality. Some researchers argue that observer ratings should be used as one piece of evidence in a broader evaluation process (e.g., Lei et al., 2018; Van der Lans, 2018), while others question the validity and fairness of using such observer ratings (e.g., Praetorius et al., 2014; Weston et al., 2021). On the one hand, some prior studies have found that observer ratings of instructional quality are consistent across different observers and occasions, and that they correlate with other measures of instructional quality, such as student ratings and student learning outcomes (Hill et al., 2012; Van der Lans, 2018). Additionally, Reddy et al. (2019) found that when observers use well-validated standardized observation tools and rubrics to assess instructional quality, the resulting ratings are generally reliable and valid. This type of rating provides a comparatively objective measure of teaching effectiveness that is not based solely on student ratings or teacher self-reports, which can be useful in providing a more complete picture of instructional quality. Additionally, observer-rated instructional quality based upon standardized instruments or rubrics allows for comparison across different classrooms and teachers and ensuring accountability, which can be useful in identifying areas of strength and weakness, making decisions about resource allocation, and supporting accountability for teaching effectiveness (Lei et al., 2018).

However, as argued by some researchers, the weaknesses of this type of rating are also obvious. First, observer ratings are inherently subjective as observers may bring their own biases, beliefs, interpretation of teaching behaviors to the rating process, causing potential variability in ratings and negatively impact the reliability and validity of such ratings (Praetorius et al., 2014). Because of this, observers need to be rigorously trained on the observation tool or rubric, and then subsequently calibrated to ensure that they are applying the tool or rubric consistently from the beginning to the end (Briggs & Alzen, 2019). Additionally, observers are typically only able to observe a limited portion of a teacher's practice and may not have a complete understanding of the context in which teaching is occurring; such limited scope may also negatively impact the validity and reliability of the ratings (Weston et al., 2021). Furthermore, observer ratings can be resource-intensive, requiring trained observers, observation time, and rating instrument development and validation, which can severely limit the feasibility of using observer ratings in large-scale evaluation systems (Hill et al., 2012; Reddy et al., 2019).

Despite these limitations, observer-rated instructional quality has remained a valuable tool. Obviously, its validity and reliability largely rely on that of the observation tool or rubric (Reddy et al., 2019), whether observers are rigorously trained, and rating calibration is considered (Briggs & Alzen, 2019). Since data collected from observer ratings can possibly provide researchers with rich information about the actual teaching and learning process (Grossman et al., 2014; Hill et al., 2011; Pianta & Hamre, 2009), particularly when video recording of teachers’ lessons are obtained and then analyzed by trained raters, it has gained more popularity in recent decades with the affordance of modern video technology as evidenced by the Video Study of Trends in International Mathematics and Science Study (TIMSS) in the 1990s (Stigler & Hiebert, 2009) and then the Organization for Economic Co-Operation and Development (OECD)'s large scale Global Teaching Insights (GTI) study in 2018 (OECD, 2020a). Having reviewed the merits and limitations of the three types of data collected from teachers, students, and observers, we will examine the correlations among them and their predictive power in relation to student achievement.

2.5 Alignment of three types of ratings and their predictive power

Studies on the alignment between ratings reported by teachers, students, and external observers have provided some good evidence of their associations but the results are quite divergent. First, while there is no consistent pattern across studies, the majority of research suggests that teachers tended to rate their own teaching practices higher than external observers did (e.g., Debnam et al., 2015; Hansen et al., 2014; Kaufman et al., 2016), as teachers have a vested interest in presenting their teaching in the best possible light, while external observers are more objective and have no personal investment in the outcome. Additionally, external observers may have a broader perspective on teaching and may focus on different aspects of instructional quality than teachers themselves.

Regarding the alignment between the ratings reported by teachers and their students, discrepancies also exist and only a low to moderate correlation was found in the studies that were conducted in the past two decades (Desimone et al., 2010; Kunter & Baumert, 2006; Wagner et al., 2016; Wisniewski et al., 2020). Similarly, regarding the alignment between the ratings by students and external observers, some earlier studies found that correlations between these two types of ratings typically fall within the 0–0.50 range (Fauth et al., 2014; Begrich et al., 2020). However, Fauth et al., (2020) found that teaching quality as rated by students and external observers varied considerably from 0.39 to 0.85 for the various dimensions of teaching quality between classes, with student motivation as a significant predictor of such variations in teaching quality as rated by external observers and students. As argued by some researchers (Kunter & Baumert, 2006; Wagner et al., 2016), attaining higher correlations among the three types of ratings from teachers, students, and observers would provide robust evidence for the construct validity of teaching quality, which can subsequently have potential in informing classroom instruction and teaching evaluation. However, the low to moderate correlations observed among the three types of ratings, coupled with the discrepancies in results identified in previous studies, suggest that more research is needed to examine the causes for such discrepancies.

Regarding the predictive power of different types of ratings for student mathematics achievement, some studies found that student-reported instructional quality exhibits greater predictive power for noncognitive learning outcomes, such as engagement (Lauermann & Berger, 2021) or motivation (Schiefele & Schaffner, 2015). The noticeable limitations of these studies include their predominate use of one or two types of data in analyzing teachers’ instructional practices and a weak focus on student cognitive learning outcomes. These limitations are significant considering the low level of agreement among the ratings of instructional practices provided by teachers, students, and external observers (Desimone et al., 2010; Fauth et al., 2020; Kunter & Baumert, 2006). Such limitations along with the discrepancies related to the alignment of the three types of ratings in the existing studies necessitates the inclusion of different perspectives in order to better analyze teachers’ instructional practices. This multifaceted assessment approach will enable the researchers to triangulate the data and potentially obtain a more thorough and accurate depiction of teachers’ instructional practices that plays a critical role in shaping student learning outcome.

3. The current study

In this study, we aim to address the gaps and limitations in the existing literature by utilizing the data of the Shanghai sample from the large-scale study of GTI, which is a video study of teaching administered by OECD. The GTI study collected valuable data related to the instructional practices self-reported by mathematics teachers as well as observed by students and trained raters (Opfer, 2020). This valuable data provided a richer picture of teachers’ instructional practices that can be analyzed to explore the alignment of the three types of ratings and their relationship with the critical indicator of teaching quality, that is, students’ mathematics achievements (Goe et al., 2008; Hill et al., 2012; Little et al., 2009). More importantly, the GTI study purposefully designed the same sets of questions that focus on mathematics teachers’ instructional practices in the student and teacher questionnaires, which enabled the researchers to examine whether the proposed conceptual components outlined in the Quality Teaching framework of the GTI study below can be identified in teachers’ self-reported practices and those reported by their students (Praetorius et al., 2020a, 2020b).

3.1 The quality teaching framework of the GTI study

To measure mathematics teachers’ teaching quality across different countries and economies, the GTI study team first conceptualized what constitutes teaching quality and then developed its own instrument to achieve this goal. The GTI study's conceptualization process of teaching quality drew on various sources that include the specific conceptualizations of quality teaching and standards on teaching from each participating countries and economies, a critical review of global observation literature on teaching quality and teaching effectiveness, and the pertinent conceptual frameworks on teaching quality from Teaching and Learning International Survey (TALIS) 2018 and Program for International Student Assessment (PISA) administered by OECD to ensure close alignment (Klieme, 2020). The resulting six-domain quality teaching framework consists of classroom management, social-emotional support, discourse, quality of subject matter, students’ cognitive engagement, and assessment of and responses to students’ understanding (Bell et al., 2018, 2020).

Since the framework was constructed based on theoretical considerations rather than empirical data (Castellano & Bell, 2020) and a more simplified domain structure might exist, the GTI study team combined the latter four domains into the general realm of “instruction,” which is a latent construct specifically focusing on the mathematics instructional practices, then tested it with confirmation factor analysis. The team found that the model that contains the general realm of “Instruction” had good fit statistics, robust CFI = 0.96, robust TLI = 0.92, RMSEA = 0.10, which allow it to become an overarching latent construct encompassing the four sub-domains, that is, discourse, quality of subject matter, cognitive engagement, and assessment of and responses to student understanding (Castellano & Bell, 2020). Therefore, this latent construct was used as the coding framework by trained raters to rate mathematics teachers’ instructional practices and the data collected was adopted in our analysis to answer research question 2 and 3 as listed below, together with teachers’ and students’ rating data. Additionally, the framework of the latent construct “Instruction” was also used to guide our analysis to answer our first research question. We will discuss in more details about these in the next section.

Specifically, the current study seeks to answer the following three research questions:

RQ1: What are the different conceptual components of the instructional practices reported by teachers and students?

RQ2: To what extent do instructional practices reported by teachers, students, and external raters align with each other?

RQ3: How do the instructional practices reported by teachers, students, and external raters compare in predicting students’ mathematics achievement?

4. Method

4.1 Data source

To answer our research questions, we used the Shanghai dataset from the large international GTI study conducted between 2016 and 2020 collaboratively among a total of eight OECD member countries and partner economies with the aim to measure mathematics teachers’ instructional practices and their relation to student cognitive learning and non-cognitive outcomes (Opfer et al., 2020). To identify any possible shared or divergent patterns of mathematics teaching across the eight countries and economies, the GTI study purposefully selected the topic of quadratic equations that are typically taught at the secondary school level in all countries and economies. In Shanghai, this important topic is taught at the eighth-grade level. A total of 2,613 students around the age of 14 along with their 85 teachers across 85 schools from Shanghai participated in the GTI study.

4.2 Instruments

4.2.1 Student and teacher prequestionnaire and postquestionnaire

The GTI study administered both student prequestionnaire and postquestionnaire to collect student background and learning-related information. The prequestionnaire was administered prior to the focal unit instruction on quadratic equations while the postquestionnaire was administered at the conclusion of the unit. Additionally, the GTI study also administered teacher prequestionnaire and postquestionnaire at the same time student prequestionnaire and postquestionnaire were administered to collect teachers’ background and focal unit-related information such as lesson goals, content covered, etc. It is in the student postquestionnaire that students reported their rating of teachers’ instructional practices while in teachers’ postquestionnaire that teachers’ self-reported instructional practiced were collected, both of which were on a scale of 1–4 with 1 indicating “never or almost never” and 4 indicating “always” in using each specific instructional practice (see Tables 1 and 2).

Table 1.
Survey items in teacher post-questionnaire asking teachers to self-report their instructional practices during the focal unit on quadratic equations.

Survey items and label Rating scale

How often did you do the following?

TQB08A: I presented a summary of recently learned content. 1 = Never or almost never
2 = Occasionally
3 = Frequently
4 = Always

TQB08B: I set goals at the beginning of instruction.

TQB08C: I explained what I expected these students to learn.

TQB08D: I explained how new and old topics are related.

TQB08E: I presented tasks for which there is no obvious solution.

TQB08F: I presented tasks that required these students to apply what they had learned to new contexts.

TQB08G: I gave tasks that required these students to think critically.

TQB08H: I asked these students to decide on their own procedures for solving complex tasks.

TQB08I: I gave these students opportunities to explain their ideas.

TQB08J: I encouraged these students to question and critique arguments made by other students.

TQB08K: I required these students to engage in discussions among themselves.

How frequently did you engage in the following activities during the lessons on quadratic equations?

TQB09A: I explained why a mathematical procedure works.

TQB09B: I illustrated why a mathematical procedure works by using concrete examples or graphics.

TQB09C: I asked questions that helped these students understand why a procedure works.

TQB09D: I compared different ways of solving problems.

Survey items and label	Rating scale
How often did you do the following?
TQB08A: I presented a summary of recently learned content.	1 = Never or almost never 2 = Occasionally 3 = Frequently 4 = Always
TQB08B: I set goals at the beginning of instruction.
TQB08C: I explained what I expected these students to learn.
TQB08D: I explained how new and old topics are related.
TQB08E: I presented tasks for which there is no obvious solution.
TQB08F: I presented tasks that required these students to apply what they had learned to new contexts.
TQB08G: I gave tasks that required these students to think critically.
TQB08H: I asked these students to decide on their own procedures for solving complex tasks.
TQB08I: I gave these students opportunities to explain their ideas.
TQB08J: I encouraged these students to question and critique arguments made by other students.
TQB08K: I required these students to engage in discussions among themselves.
How frequently did you engage in the following activities during the lessons on quadratic equations?
TQB09A: I explained why a mathematical procedure works.
TQB09B: I illustrated why a mathematical procedure works by using concrete examples or graphics.
TQB09C: I asked questions that helped these students understand why a procedure works.
TQB09D: I compared different ways of solving problems.

Table 2.

Survey items in student postquestionnaire asking students to rate their teachers’ instructional practices during the focal unit on quadratic equations.

Survey items and label	Rating scale
How often did your mathematics teacher do the following? Our mathematics teacher …	1 = Never or almost never 2 = Occasionally 3 = Frequently 4 = Always
SQB08A: Presented a summary of recently learned content.
SQB08B: Set goals at the beginning of instruction.
SQB08C: Explained what he/she expected us to learn.
SQB08D: Explained how new and old topics are related.
SQB08E: Presented tasks for which there is no obvious solution.
SQB08F: Presented tasks that required us to apply what we had learned to new contexts.
SQB08G: Gave tasks that required us to think critically.
SQB08H: Asked us to decide on our own procedures for solving complex tasks.
SQB08I: Gave us opportunities to explain our ideas.
SQB08J: Encouraged us to question and critique arguments made by other students.
SQB08K: Required us to engage in discussions among ourselves.
How frequently did your mathematics teacher engage in the following activities during the lessons on quadratic equations? Our mathematics teacher …
SQB09A: Explained why a mathematical procedure works.
SQB09B: Illustrated why a mathematical procedure works by using concrete examples or graphics.
SQB09C: Asked questions that helped us understand why a procedure works.
SQB09D: Compared different ways of solving problems.

4.2.2 Coding instrument for recorded lessons

The aforementioned theoretical construct “Instruction” formulated by the GTI study team consists of four domains and was used as the blueprint for designing the coding instrument, which was then used by trained raters to rate each mathematics teacher's instructional practices. In the following, we will explain what each of the four domains entail and how the coding instrument was used to rate the teachers’ instruction.

The discourse domain refers to the conversational interactions between and among teachers and students that promote active engagement, critical thinking, and the exchange of ideas. Effective discourse provides students with opportunities to articulate their ideas, respond to the ideas of others, and engage in productive conversations that deepen their understanding of the subject matter (Praetorius et al., 2020a, 2020b). This domain was assessed with three components from a 4-point Likert scale that includes the nature of discourse (whether students actively contribute to the discourse), questioning (whether the questions have a good mix of various levels of cognitive involvement and help to engage the students in critical analysis, synthesis, or justification), and explanations (whether such explanations emphasize extended and profound mathematical content) (see Table 3).

Table 3.
Coding domains and components for the latent construct instruction used by trained raters.

Domain Components with sample questions*

Discourse Nature of discourse
- Do students help direct and shape the classroom conversation?

Questioning
- How often does the teacher ask how and why questions?

Explanations
- How often do detailed and deep explanations occur?

Quality of subject matter Explicit connections
- Does the teacher devote time to connecting ideas?

Explicit patterns and generalizations
- Does the teaching include patterns and generalizing in the subject matter?

Student cognitive engagement Engagement in cognitively demanding subject matter
- Are students cognitively challenged and stretched intellectually?

- How often do students have opportunities to think critically?

Multiple approaches to/perspectives on reasoning
- Do students have opportunities to engage in multiple approaches to reasoning?

Understanding of subject matter procedures and processes
- Does the teacher ask students to learn a procedure or method?

- Is the focus as much on learning the steps as it is on understanding them?

Assessment of and responses to student understanding Eliciting student thinking
- How often does the teacher ask students to explain the reasoning behind their work?

Teacher feedback
- How often does the teacher give students feedback about their work or ideas?

Aligning instruction to present student thinking
- Is the teaching shaped by what the students say or do?

- Do ideas expressed by students shape how the lesson moved forward?

Domain	Components with sample questions*
Discourse	Nature of discourse - Do students help direct and shape the classroom conversation?
Questioning - How often does the teacher ask how and why questions?
Explanations - How often do detailed and deep explanations occur?
Quality of subject matter	Explicit connections - Does the teacher devote time to connecting ideas?
Explicit patterns and generalizations - Does the teaching include patterns and generalizing in the subject matter?
Student cognitive engagement	Engagement in cognitively demanding subject matter - Are students cognitively challenged and stretched intellectually? - How often do students have opportunities to think critically?
Multiple approaches to/perspectives on reasoning - Do students have opportunities to engage in multiple approaches to reasoning?
Understanding of subject matter procedures and processes - Does the teacher ask students to learn a procedure or method? - Is the focus as much on learning the steps as it is on understanding them?
Assessment of and responses to student understanding	Eliciting student thinking - How often does the teacher ask students to explain the reasoning behind their work?
Teacher feedback - How often does the teacher give students feedback about their work or ideas?
Aligning instruction to present student thinking - Is the teaching shaped by what the students say or do? - Do ideas expressed by students shape how the lesson moved forward?

*More sample questions are available in Global Teaching InSights: Observation Tools (OECD, 2020b).

Quality of subject matter refers to the clarity and accuracy of the content and tasks as well as the ability of students and teachers to make explicit connections between the subject matter, procedures, viewpoints, and clear and appropriate representations or equations, with components that include explicit connections as well as explicit patterns and generalizations (Bell et al., 2020). This domain consists of two components: explicit connections, and explicit patterns and generalizations. Explicit connections refer to connections among subject matter ideas, procedures, or equations, between the content being learned in mathematics class and real-world contexts, and across different subject areas. Explicit patterns and generalizations refer to the extent to which teachers and students actively look for patterns in mathematical problems and use those patterns to generalize and draw conclusions (see Table 3).

Student cognitive engagement domain refers to students’ active and thoughtful interaction with the subject matter, and it encompasses a range of activities that require them to analyze, create, and evaluate information (Praetorius et al., 2020a, 2020b). It has three main components that include demanding subject matter, multiple approaches to/perspectives on reasoning, and understanding of subject matter procedures and processes, with which teachers can create a rich learning environment that promotes students’ cognitive engagement and supports their development of knowledge and skills. This domain was assessed with three components from a four-point Likert scale that includes engagement in cognitively demanding subject matter (whether teachers engage students in analytic, creative, or evaluative tasks that are also cognitively challenging and thoughtfulness demanding), multiple approaches to and perspectives on reasoning (whether students employ more than two techniques or lines of reasoning to thoroughly solve the problem), and understanding of subject matter procedures and processes (whether students regularly focus on the justification of the procedures and processes when encountering them) (see Table 3).

Assessment of and response to student understanding involves teachers positioning their instruction with their students’ thinking to provoke conceptual attainment, ensure better evaluation, and provide feedback in order to foster deeper learning (Praetorius et al., 2020a, 2020b). By eliciting students’ thinking, providing teacher feedback, and aligning instruction with students’ thinking, teachers can create a learning environment that supports the development of students’ knowledge and skills and promotes deeper learning (Praetorius et al., 2020a, 2020b). This domain was assessed with three components from a three-point Likert scale that includes eliciting student thinking (whether teachers’ questions, academic prompts, and meaningful tasks they use can bring about a variety of student contributions), teacher feedback (whether there exist common loops of feedback and thorough teacher-student discussions), and aligning instruction to present student thinking (whether teachers regularly give scaffolding to help students obtain conceptual understand when they get stuck) (see Table 3).

The rigorously trained and certified raters of the GTI study coded the recorded lessons of each mathematics teachers according to the component rating rubric for each of the domains. All component scores were rated on a 4-point scale, with 1 indicating the lowest level or least frequency while 4 indicates the highest level or highest frequency in teachers’ use of certain components during their teaching on quadratic equations (Bell et al., 2020). The detailed rubrics for each component code can be accessed in the Chapter Annex of the GTI Technical Report (International Project Consortium, 2020).

4.3 Student mathematics achievement

The GTI study administered both a pretest and a post-test to the participating students within 2 weeks before their teachers started the instruction of quadratic equations and two weeks immediately following the completing of instruction on quadratic equations (Praetorius et al., 2020a, 2020b). Since the 30-item pretest measured students’ mastery of general mathematics knowledge while the 25-item post-test purely assessed students’ mastery of knowledge and skills related to quadratic equations, this study used student post-test scores as the measure of student mathematics achievement to answer the third research question. Both the pretest and post-test scores were first standardized according to the mean of Shanghai sample (mean = 231.11) and standard deviation (SD = 13.93), then were rescaled on the basis of the IRT model to make it range from 100 to 300 with 200 as the mean and 25 as the standard deviation across the entire international sample in order to make the achievement data comparable across the participating countries or economies (Doan & Mihaly, 2020). There were 4 missing values in the teacher dataset (TQB08A-D), and 59 missing values in the Shanghai student dataset (SQB08 and SQB09). Missing value in TQB08A-D, SQB08, and SQB09 were replaced by 0 using zero imputation method, and 37 missing values in student post-test scores were deleted from the dataset by following the analytical recommendations of the GTI study team (Doan & Mihaly, 2020).

4.4 Data analytic approach

To answer the first research question, exploratory factor analysis (EFA) using principal axis factoring (PAF) extraction method and a promax (oblique) rotation with Kaiser normalization was performed for teacher self-reported teaching practice and observed practices as reported by students. EFA is typically used to identify the underlying factors that may explain the correlations among a set of observed variables (Finch, 2019); therefore, it is the appropriate analytic approach for the first research question that seeks to identify the conceptual components of instructional practices as reported by teachers as well as their students. Approximate fit indices including Kaiser-Meyer-Olkin (KMO) Measure of Sampling Adequacy and Bartlett's Test of Sphericity were used in determining the model fit during the EFA analysis (Finch, 2019).

To answer the second research question, we used correlation analysis to identify the possible relationship among the EFA factors extracted in the last step and the Instruction factor based upon external observers’ rating. Aggregation of student-level data to the class level was conducted before running the correlational analysis. Ordinary least square regression was then used to answer the third research question regarding the predictive power of instructional practices reported by teachers, students, and external raters in explaining the variance of student post-test scores obtained after the unit instruction on quadratic equations.

5 Results

5.1 RQ1: Conceptual components of instructional practices reported by teachers and students

Results from the EFA analysis show that the correlation matrix of the instructional practices reported by teachers and students is factorable as some correlation coefficients are greater than 0.30, KMO Measure of Sampling Adequacy is 0.80 for teacher level data, and 0.94 for the student level data, Bartlett's Test of Sphericity are both significant at 0.001 level; the values in the anti-image correlation matrix are small with 3% of these values being larger than 0.20 for the teacher level variables and all values being smaller than 0.08 for the student level variables, while the Measures of Sampling Adequacy (MSA) values range from 0.63 to 0.87 for teacher level individual variables and 0.91 to 0.96 for student level individual variables, which are all larger than 0.50, suggesting adequate model fit (Cerny & Kaiser, 1977; Tabachnick & Fidell, 2018).

Principal axis factor analysis with Promax rotation and Kaiser normalization for the 15 teacher-reported survey items of TQB08A-K and TQB09A-D yielded four factors with eigenvalues greater than 1.0, explaining 49.72% of the total variance after rotation. Table 4 showed the factor loadings for each item on each of the four extracted factors. Items with factor loadings greater than or equal to 0.30 were considered to load significantly on a given factor. The results indicated that Factor 1 (Classroom Discourse) was primarily defined by items TQB08H, TQB08I, TQB08J, and TQB08K, Factor 2 (Meaning Making) by items TQB09A, TQB09B, TQB09C, and TQB09D, Factor 3 (Clarity Instruction) by TQB08A, TQB08B, TQB08C, TQB08D, and TQB08F, and Factor 4 (Cognitive Activation) by TQB08B, TQB08E, TQB08F, TQB08G, and TQB08H. The negative loading of TQB08B in Factor 4 might mean that this item described the opposite of this factor.

Table 4.
Items and factor loadings from the exploratory factor analysis with promax rotation for teachers’ self-reported data.

Factors

Items Classroom discourse Meaning making Clarity instruction Cognitive activation

TQB08A 0.522

TQB08B 0.556 −0.312

TQB08C 0.619

TQB08D 0.707

TQB08E 0.629

TQB08F 0.316 0.306

TQB08G 0.604

TQB08H 0.589 0.373

TQB08I 0.901

TQB08J 0.797

TQB08K 0.772

TQB09A 0.772

TQB09B 0.813

TQB09C 0.729

TQB09D 0.425

Eigenvalues 4.461 2.050 1.761 1.170

Variance (%) 26.649 10.900 8.209 3.957

	Factors
TQB08A			0.522
TQB08B			0.556	−0.312
TQB08C			0.619
TQB08D			0.707
TQB08E				0.629
TQB08F			0.316	0.306
TQB08G				0.604
TQB08H	0.589			0.373
TQB08I	0.901
TQB08J	0.797
TQB08K	0.772
TQB09A		0.772
TQB09B		0.813
TQB09C		0.729
TQB09D		0.425
Eigenvalues	4.461	2.050	1.761	1.170
Variance (%)	26.649	10.900	8.209	3.957

Principal axis factor analysis with Promax rotation and Kaiser normalization for the 15 student-reported survey items of SQB08A-K and SQB09A-D yielded three factors with eigenvalues greater than 1.0. Although the eigenvalue of Classroom Discourse was 0.83, it explained over 5% variance before rotation. Therefore, we included this factor in the four-factor structure that explained 57.95% of the total variance after rotation. Table 5 showed the factor loadings for each item on each of the four extracted factors. The results indicated that Clarity Instruction was primarily defined by items SQB08A, SQB08B, SQB08C, SQB08D, and SQB08F, Meaning Making was defined by items SQB09A, SQB09B, SQB09C, and SQB09D, cognitive activation was defined by items SQB08E, SQB08F, SQB08G, and SQB08H, and Factor 4, classroom discourse, was defined by items SQB08I, SQB08J, and SQB08K. Both SQB08F and SQB09D had loadings that exceed 0.30 onto two factors.

Table 5.

Items and factor loadings from exploratory factor analysis with promax rotation for student data.

	Factors
Items	Clarity instruction	Meaning making	Cognitive activation	Classroom discourse
SQB08A	0.688
SQB08B	0.876
SQB08C	0.855
SQB08D	0.700
SQB08E			0.672
SQB08F	0.338		0.432
SQB08G			0.689
SQB08H			0.550
SQB08I				0.808
SQB08J				0.820
SQB08K				0.495
SQB09A		0.790
SQB09B		0.882
SQB09C		0.828
SQB09D	0.327	0.303
Eigenvalues	7.205	1.220	1.105	0.826
Variance (%)	45.365	5.609	4.161	2.816

5.2 RQ2: Alignment of instructional practices reported by teachers, students, and external raters

Results from the correlation analysis are presented in Table 6. The correlation coefficients of each pair of the four teacher self-reported factors range from 0.305 to 0.493, and those between the four student self-reported factors range from 0.769 to 0.889, p < .01. The correlation coefficients between the four corresponding pairs of factors at the student and teacher level were all positive but quite weak, and of the four pairs, only significant correlation was found between the clarity instruction factor pair, r = 0.259, p < .05. Additionally, all the student-reported teaching practice factors had moderate and significant positive correlations with rater-reported teaching practice factor (r_RS−SCA = 0.453, p < .01; r_RS−SCI = 0.429, p < .01; r_RS−SCD = 0.541, p < .01; r_RS−SMM = 0.429, p < .01). Lastly, three of the four teacher self-reported factors had non-significant but positive, weak correlations that range from .043 to .106 with rater-reported teaching practice factor and one factor, Clarity Instruction, had a very weak, non-significant, negative correlation with rater-reported teaching practice factor.

Table 6.
Correlations among instructional practices scores reported by teachers, students, and external raters.

TCA TCI TMM TCD SCA SCI SCD SMM RS

TCI 0.475

TMM 0.484 0.366

TCD 0.493 0.326** 0.305*

SCA 0.156 0.288** 0.097 0.119

SCI 0.085 0.259* 0.143 0.058 0.817**

SCD 0.203 0.219* 0.103 0.200 0.769 0.812

SMM 0.177 0.205 0.213 0.164 0.796 0.889 0.829

RS 0.106 −0.05 0.043 0.079 0.453 0.429 0.541 0.429*

Postscores 0.289** 0.232* 0.186 0.155 0.290 0.290 0.315** 0.264* 0.217*

	TCA	TCI	TMM	TCD	SCA	SCI	SCD	SMM	RS
TCI	0.475**
TMM	0.484**	0.366**
TCD	0.493**	0.326**	0.305*
SCA	0.156	0.288**	0.097	0.119
SCI	0.085	0.259*	0.143	0.058	0.817**
SCD	0.203	0.219*	0.103	0.200	0.769**	0.812**
SMM	0.177	0.205	0.213	0.164	0.796**	0.889**	0.829**
RS	0.106	−0.05	0.043	0.079	0.453**	0.429**	0.541**	0.429*
Postscores	0.289**	0.232*	0.186	0.155	0.290**	0.290**	0.315**	0.264*	0.217*

Note. TCA = teacher cognitive activation; TCI = teacher clarity instruction; TMM = teacher meaning-making; TCD = teacher classroom discourse; SCA = student cognitive activation; SCI = student clarity instruction; SMM = student meaning-making; SCD = student classroom discourse; RS = rater score.

*Correlation is significant at the 0.05 level (2 tailed). **Correlation is significant at the 0.01 level (2 tailed).

5.3 RQ3: Comparison of the predictive power of three types of ratings for student post-test scores

Multiple regression analyses were conducted to answer research question three. Given that the evaluation of instructional practices involves variables from three different perspectives, it is not reasonable to include all observation variables (such as teacher cognitive activation [TCA], student cognitive activation [SCA], and rater score [RS]) in the same multiple regression equation according to theoretical assumptions. As a result, this study conducted multiple regression analyses for each distinct perspective separately, and the results are presented in Table 7. Furthermore, due to the high correlations (r_s > 0.70) among the four factors from students-reported instructional practices, this study adopted a stepwise method in the multiple regression analysis for student-reported instructional factors. As a result, only one significant factor, student classroom discourse (SCD), was identified and thus reported in Table 7.

Table 7.
Multiple regression analysis predicting student posttest scores.

Predictor β SE β_s t p R² η² _p

Teacher self-reported instructional practice

TCA 0.574 0.209 0.289 2.750 0.007 0.084 0.092

TCI 0.326 0.349 0.115 0.154 0.128

TMM 0.086 0.263 0.040 0.325 0.746

TCD −0.003 0.225 −0.002 −0.013 0.990

Student-reported instructional practice

SCD 1.075 0.355 0.315 3.028 0.003 0.099 0.110

Rater-reported instructional practice

RS 1.195 0.107 0.217 2.021 0.047 0.047 0.049

Predictor	β	SE	β_s	t	p	R²	η² _p
	Teacher self-reported instructional practice
TCA	0.574	0.209	0.289	2.750	0.007	0.084	0.092
TCI	0.326	0.349	0.115	0.154	0.128
TMM	0.086	0.263	0.040	0.325	0.746
TCD	−0.003	0.225	−0.002	−0.013	0.990
	Student-reported instructional practice
SCD	1.075	0.355	0.315	3.028	0.003	0.099	0.110
	Rater-reported instructional practice
RS	1.195	0.107	0.217	2.021	0.047	0.047	0.049

Note. TCA = teacher cognitive activation; TCI = teacher clarity instruction; TMM = teacher meaning making; TCD = teacher classroom discourse; SCD = student classroom discourse; RS = rater score.

5.3.1 Teacher perspective

Results from the correlational analysis (see Table 6) indicated that two of the four teacher self-reported factors did not have significant correlations with student post-test scores, except TCA, r_PS−TCA = 0.289, p < .01, and teacher clarity instruction (TCI), r_PS−TCI = 0.232, p < .05. The multiple regression analysis conducted to examine the relationship between teacher-reported instructional practices and student achievement included four predictor variables in the model: TCA, TCI, teacher meaning making (TMM), and teacher classroom discourse (TCD). The analysis revealed that the model was significant, F (1, 83) = 7.562, p = .007, η² _p = .092. Specifically, TCA was a significant predictor of the outcome variable (β = .574, p = .007), while TCI, TCD, and TMM did not significantly predict student post-test scores (β = .326, p = .128; β = −.003, p = .990, β = .086, p = .746, respectively). The model yielded an R-square value of .084, indicating that the predictors together explained 8.4% of the variance in the outcome variable. The variance inflation factor scores (all VIFs < 1.5) indicated that multicollinearity was not a concern in the analysis.

5.3.2 Student perspective

Results from the correlation analysis (see Table 6) indicated that all the student-reported teaching practice factors had moderate and significant positive correlations with student post-test scores (r_PS−SCA = .290, p < .01; r_PS−SCI = .290, p < .01; r_PS−SCD = .315, p < .01; r_PS−SMM = .264, p < .05). Similar to the teacher level model, four predictor variables were included in the multiple regression analysis model to examine the relationship between student reported instructional practices and student achievement: SCD, student meaning making (SMM), student clarity instruction (SCI), and SCA. Results from the analysis showed that the overall model was significant, F (1, 2574) = 9.171, p = .003, η² _p = .110. SCD was found to be the only significant predictor of student post-test scores, β = 1.075, t (2,574) = 3.028, p = .003, and it explained 9.9% of the variance in student post-test scores (R² = .099). The excluded variable analysis showed that SMM, SCI, and SCA were not significant predictors of student post-test scores.

5.3.3 Rater perspective

Results from the correlational analysis (see Table 6) indicated that the rater score was positively and significantly correlated with students’ post-test scores, r = .429, p < .01. The regression analysis model included one independent variable, rater-reported instructional practice score (RS), and the dependent variable was student post-test scores. The analysis yielded an R-square value of 0.047, indicating that the perception of instructional practices related to the domain of instruction accounted for 4.7% of the variance in post-test scores. The ANOVA results indicated that the regression model was significant, F (1, 83) = 4.083, p < .05, η² _p = 0.049. The coefficient for RS was significant, B = 1.195, t (83) = 2.021, p < .05, indicating that a one-unit increase in students’ perception of instructional practices related to the domain of instruction was associated with a 1.195-unit increase in student post-test scores.

6 Discussions

Using the Shanghai sample of the GTI study, the current study first examined the underlying conceptual components of the instructional practices reported by teachers and their students to answer our RQ1. Results from the EFA revealed that essentially the same four-factor structure was identified from both sources of survey items, though several survey items loaded differently on three of the four factors. Specifically, the factor Meaning Making was defined by the same items from both sources, and the items that define the other three factors, that is, classroom discourse, clarity instruction, and cognitive activation, are mostly overlapping. Such results can serve as important empirical evidence for the trustworthiness of the two types of data and we deem these findings to be very significant considering the ongoing debate on the reliability and validity of the data perceived by teachers (Mayer, 1999; McCaffrey et al., 2001; Mullens & Gayler, 1999; Porter et al., 1993; Ross et al., 2003) or reported by their students (Fauth et al., 2014; Praetorius et al., 2012; Wagner et al., 2016) in helping evaluate instructional quality. Especially, the above results resonate with what Tsai et al. (2022) found in their study that students’ reported teaching effectiveness is in alignment with theory-based structure of the survey completed by students, and therefore further clarify the concerns of some researchers (Fauth et al., 2019, 2020) who considered students’ young age and lack of survey instrument training as the potential causes in lowering the validity and reliability of student ratings of instructional practices. Nonetheless, there are still differences in both the items that define three of the four factors, and in the order and weight of the factors that contribute to the total variances of instructional practices reported by teachers and students. The differences, along with the mostly weak and non-significant correlations between each pair of the factors, might result from the inherent nature of self-reported data (Debnam et al., 2015; Hansen et al., 2014; Kaufman et al., 2016), students’ young age, or the untrained nature before students’ completing the questionnaire (Fauth et al., 2019, 2020).

For RQ2 with regard to the alignments among the three types of ratings, our study provided new evidence that all the student-reported teaching practice factors significantly and positively correlate with rater-reported teaching practice factor in the 0.429 to 0.541 range, which falls within the 0.39–0.85 range that Fauth et al. (2020) found in their study but partially differ from the 0 to 0.50 range found by other researchers (Fauth et al., 2014; Begrich et al., 2020). The current study also found weak but positive correlations between the four pairs of factors obtained from instructional practices reported by teachers and their students, and among the four pairs, only a significant correlation was found between the pair of clarity instruction factors, which confirmed the low to moderate relationship that was found between the instructional practice ratings reported by teachers and their students by previous researchers in the past decades (Desimone et al., 2010; Kunter & Baumert, 2006; Wagner et al., 2016; Wisniewski et al., 2020). With regards to the alignment between ratings reported by teachers and external observers, our study found that among the four teacher self-reported factors, three of them had nonsignificant but positive, weak correlations that range from 0.043 to 0.106 with rater-reported factor; and interestingly, one factor, clarity instruction, had a very weak, negative correlation with rater-reported factor. Our results resonate with some previous studies that reported a discrepancy, nonconsistent pattern between the two sources of evaluation (Debnam et al., 2015; Hansen et al., 2014; Kaufman et al., 2016), possibly due to the common concerns in teachers’ self-reported data (e.g., Bradburn, 2000; Devaux & Sassi, 2016; Kaufman et al., 2016; Little et al., 2009; Tourangeau et al., 2000; Van de Vijver & He, 2014). Considering that GTI external raters were rigorously trained in their use of a well-validated observation instrument and quality control was ensured in the rating process (Bell, 2020a, 2020b), results reported by the raters are deemed highly reliable and valid (Reddy et al., 2019). The nonsignificant, weak, and even negative correlations between these two sources encourage researchers to further examine the possible underlying causes.

Additionally, when comparing the predictive power of the three types of ratings to answer RQ3, our study found that only one of the four teachers’ reported factors, cognitive activation, was a significant predictor explaining 8.4% of the variance in the outcome variable; in a similar vein, only one of the four student reported factors, classroom discourse, was found to be significantly predicting student post-test scores, and the percentage of variance in student post-test scores it explained was 9.9%. The instructional practice factor obtained from external raters was also found to be significant, but it only explained 4.7% of the variance in the outcome variable, which is the lowest of the three. The results indicate that student-reported instructional quality had a greater predictive power for student mathematics achievement, and further justifies the reliability and validity of this type of data. The evidence also echoes the findings of some previous studies, albeit non-cognitive learning outcomes were often used in those studies (e.g., Lauermann & Berger, 2021; Schiefele & Schaffner, 2015; Wagner et al., 2016). As argued by Lauermann and ten Hagen (2021), students are the authentic participants and direct recipients of teachers’ instruction, which enables them to have firsthand experience of the teaching methods and strategies used in the classroom. As participants in the classroom, especially in inquiry-based mathematics classrooms characterized by active learning and collaborative environment, students are immersed in inquiry-based activities and are encouraged to explore concepts, ask questions, and construct their own understanding (NCTM, 2000, 2014). Studies have shown that when students engage in inquiry-based collaborative activities, they are more likely to develop a deeper understanding of mathematical concepts, improve problem-solving skills, and enhance critical thinking abilities (Elbers, 2003; Goos, 2004; Staples, 2007), along with better student engagement, motivation, and active participation, all leading to improved academic performance and long-term retention of mathematical concepts (Blazar, 2015; Carpenter et al., 1996). Their active involvement in the learning process may allow them to provide valuable insights and feedback on the effectiveness of teaching methods, clarity of explanations, and overall instructional quality, especially considering that this type of rating is aggregated across multiple lessons that allows for a more comprehensive and representative view of the teacher's instructional practices (Gaertner & Brunner, 2018; Lauermann & ten Hagen, 2021). Consequently, their perceptions and feedback shed valuable insights on the effectiveness of instructional practices. By considering their perspectives, researchers and educators may gain a better understanding of the factors that contribute to successful learning in mathematics.

Lastly, the two different factors from teacher and student data, that is, cognitive activation and classroom discourse, that were found to be significant in predicting student mathematics learning outcome is interesting since both factors point to the active learning category, suggesting that teaching practices that orient towards fostering higher order critical thinking skills to activate profound cognitive process and encouraging meaningful, relevant classroom discussions in mathematics classrooms are important in helping students achieve better learning results. These valuable quality instructional practices are endorsed by the researchers and policy makers in both the East Asian and US contexts (Cai & Ding, 2017; Carpenter et al., 1996; Ding et al., 2022; Fennema et al., 1996; Leung, 2001; Leung & Li, 2010), and should be promoted in all mathematics classrooms.

7 Limitations, implications, and future directions

Before discussing the implications of the results, we would like to note two limitations of this study. First, results from the current study are based on the Shanghai data from the GTI study that included 85 teachers and 2,613 students. Shanghai is a developed urban city in China and generalization of the results to other contexts should be made with caution. Second, the GTI study is specifically designed to explore classroom instructional practices related to one essential topic in mathematics—quadratic equations. Hence, the results should be interpreted accordingly and implications to be discussed are bounded thereof.

Results from our study indicate that the alignment between ratings reported by students and external raters is the highest among the three types of ratings, and student-reported instructional practices have more predictive power than the other two for student mathematics achievement on the post-test taken after the quadratic equations unit, which suggest that student-reported mathematical teaching practices are a better source of data for evaluating mathematics instructional quality. In the future, program designers might need to give more weight to student-reported instructional practices and provide students with training before using the instrument to help students discern various dimensions of teaching and therefore improve the validity and reliability of such data (Fauth et al., 2019, 2020; Tsai et al., 2022). Additionally, more studies can be conducted to evaluate if the same result can be found in other grade levels and subject areas such as science or language arts education.

Furthermore, the result that classroom discourse in student-reported instructional practices emerged as the most significant factor contributing to student achievement suggests that in mathematics teaching, teachers should prioritize the development of a supportive learning environment and positive classroom norms to foster strong teacher-student and student-student relationships. When students feel safe and have a positive rapport with their teachers and peers, they are more likely to actively participate in class discussions, ask questions, and share their thoughts and ideas related to the concepts and skills they are learning (Nathan & Knuth, 2003). This engagement can enhance their understanding and ultimately contribute to improved mathematics achievement. Classroom discourse has gained much support from the mathematics education reform initiatives in China (Ministry of Education of the People's Republic of China, 2011, 2022) as well as in the West, especially in the United States (Council of Chief State School Officers, 2010; National Council of Teachers of Mathematics, 2000, 2014), and was also endorsed by numerous studies (e.g., Nathan & Knuth, 2003; Zhao et al., 2016). More effort is needed to devote to analyzing the nuances of classroom discourse, aiming to enhance its effectiveness in facilitating the optimal learning and teaching of school mathematics.

Lastly, GTI's design is unique in that it incorporates three different types of measures from teachers, students, and external raters to evaluate instructional practices (Opfer, 2020). Given that only teachers and students responded to the same types of survey questions in GTI, efforts should be made to better align the perspectives of teachers, students, and external observers on what constitutes effective instructional practices; thus, new waves of GTI and other large-scale studies might consider designing the same instrument that can be modified to be used by teachers, students, and external raters in their rating of classroom instructional practices, which would potentially better facilitate the comparison of the validity and reliability of the three types of measures, ultimately clarifying the inconsistencies and discrepancies found in existing studies.

8 Conclusion

Using the Shanghai data from the GTI study, this study first compared the conceptual components in the instructional practices reported by teachers and their students. Subsequently, it examined the level of alignment among the instructional practices reported by the three sources. Lastly, it investigated the predictive powers of these three types of ratings for student mathematics achievement. It was found that the same four conceptual components were identified in the instructional practices reported by teachers and their students, with the cognitive activation factor in teacher-reported data and the classroom discourse factor in student-reported data as the most significant ones respectively in predicting student post-test scores. Furthermore, when comparing the three ratings, we found the alignment between ratings reported by students and external raters to be the highest. Notably, student ratings of their mathematics teachers’ instructional practices displayed the highest predictive power for students’ post-test scores. These findings from our study provide crucial empirical evidence that highlights the importance of cognitive activation and classroom discourse in mathematics instruction. We urge researchers, practitioners, and policy makers to pay close attention to the instructional practices reported by students, as they can serve as a valuable and reliable source of data for evaluating the quality of mathematics instruction.

Footnotes

Acknowledgements

The authors would like to extend their sincere appreciation to the anonymous reviewers for their insightful comments, and to Jeanine Raush, Ed.D., for her professional assistance in proofreading the draft of the manuscript.

Contributorship

Qiang Cheng conceptualized and designed the study with inputs from Shaoan Zhang and Jinkun Shen. Jinkun Shen and Qiang Cheng contributed to the data analysis and result reporting. The first draft of the manuscript was written by Qiang Cheng. Shaoan Zhang and Jinkun Shen provided comments on previous versions of the manuscript.

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) received no financial support for the research, authorship, and/or publication of this article.

ORCID iD

Qiang Cheng

Author biographies

Qiang Cheng (Ph.D.) is an associate professor of elementary education in the Department of Teacher Education at the University of Mississippi. His research interests focus on preservice teacher learning as well as in-service teacher instructional practices and their effect on student achievement.

Jinkun Shen is a Ph.D. student and dedicated researcher in teacher education at the University of Nevada, Las Vegas. His research focuses on advancing the understanding of effective teaching practices in the global context.

Shaoan Zhang (Ph.D.) is a professor of teacher education in the Department of Teaching and Learning at the University of Nevada, Las Vegas. His research interests focus on teacher development and doctoral education.

References

Aquilino

W. S.

(1994). Interview mode effects in surveys of drug and alcohol use: A field experiment. Public Opinion Quarterly, 58(2), 210–240. https://doi.org/10.1086/269419

Aquilino

W. S.

(1998). Effects of interview mode on measuring depression in younger adults. Journal of Official Statistics, 14(1), 15–29.

Barber

Mourshed

(2007). How the world’s best performing school systems come out on top. Atlanta, GA: McKinsey.

Begrich

Fauth

Kunter

(2020). Who sees the most? Differences in students’ and educational research experts’ first impressions of classroom instruction. Social Psychology of Education, 23, 673–699. https://doi.org/10.1007/s11218-020-09554-2

Bell

(2020a). Chapter 4: The development of the study observation coding system. In OECD (Ed.), OECD Global teaching insights: technical report. Paris, France: OECD Publishing.

Bell

(2020b). Chapter 6: Rating teaching components and indicators of video observations. In OECD (Ed.), OECD Global teaching insights: technical report. Paris, France: OECD Publishing.

Bell

Klieme

Praetorius

A. K.

(2020). Chapter 2: Conceptualizing teaching quality into six domains for the study. In OECD (Ed.), OECD Global teaching insights: technical report. Paris, France: OECD Publishing.

Bell

Witherspoon

Howell

Barragan

(2018). Annex B: TALIS video observation codes: Indicators. In OECD (Ed.), OECD Global teaching insights: technical report. Paris, France: OECD Publishing.

Blazar

(2015). Effective teaching in elementary mathematics: Identifying classroom practices that support student achievement. Economics of Education Review, 48, 16–29. https://doi.org/10.1016/j.econedurev.2015.05.005

10.

Bodzin

A. M.

Beerer

K. M.

(2003). Promoting inquiry-based science instruction: The validation of the science teacher inquiry rubric (STIR). Journal of Elementary Science Education, 15(2), 39. https://doi.org/10.1007/BF03173842

11.

Bokhove

(2022). Are instructional practices different between East and West? An analysis of grade 8 TIMSS 2019 data. Asian Journal for Mathematics Education, 1(2), 221–241. https://doi.org/10.1177/27527263221109752

12.

Bradburn

N. M.

(2000). Temporal representation and event dating. In Stone

A. A.

Bachrach

C. A.

Jobe

J. B.

Kurtzman

H. S.

Cain

V. S.

(Eds.), The science of self-report: Implications for research and practice (pp. 49–61). Mahwah, NJ: Lawrence Erlbaum.

13.

Briggs

D. C.

Alzen

J. L.

(2019). Making inferences about teacher observation scores over time. Educational and Psychological Measurement, 79(4), 636–664. https://doi.org/10.1177/0013164419826237

14.

Brophy

J. E.

(2000). Brussels, Belgium: International Academy of Education (IAE); Geneva, Switzerland: International Bureau of Education (IBE).

15.

Brophy

J. E.

Good

T. L.

(1986). Teacher behavior and student achievement. In Wittrock

(Ed.), Handbook of research on teaching (3rd ed., pp. 328–375). New York, NY: MacMillan.

16.

Cai

Ding

(2017). On mathematical understanding: Perspectives of experienced Chinese mathematics teachers. Journal of Mathematics Teacher Education, 20, 5–29.

17.

Carpenter

T. P.

Fennema

Franke

M. L.

(1996). Cognitively guided instruction: A knowledge base for reform in primary mathematics instruction. The Elementary School Journal, 97(1), 3–20. https://doi.org/10.1086/461846

18.

Castellano

K. E.

Bell

C. A.

(2020). Chapter 19: Video component score characteristics. In OECD (Ed.), OECD global teaching insights: Technical report. Paris, France: OECD Publishing.

19.

Cerny

B. A.

Kaiser

H. F.

(1977). A study of a measure of sampling adequacy for factor-analytic correlation matrices. Multivariate Behavioral Research, 12(1), 43–47. https://doi.org/10.1207/s15327906mbr1201_3

20.

Chan

(2009). So why ask me? Are self-report data really that bad? In Lance

C. E.

Vandenberg

R. J.

(Eds.), Statistical and methodological myths and urban legends: Doctrine, verity and fable in the organizational and social sciences (pp. 311–338). New York, NY: Routledge.

21.

Cheng

(2014). Quality mathematics instructional practices contributing to student achievements in five high-achieving Asian education systems: An analysis using TIMSS 2011 data. Frontiers of Education in China, 9(4), 493–518. https://doi.org/10.3868/s110-003-014-0042-x

22.

Council of Chief State School Officers. (2010). Common core state standards for mathematics. https://learning.ccsso.org/wp-content/uploads/2022/11/Math_Standards1.pdf

23.

Debnam

K. J.

Pas

E. T.

Bottiani

Cash

A. H.

Bradshaw

C. P.

(2015). An examination of the association between observed and self-reported culturally proficient teaching practices. Psychology in the Schools, 52(6), 533–548. https://doi.org/10.1002/pits.21845

24.

Desimone

L. M.

Smith

T. M.

Frisvold

D. E.

(2010). Survey measures of classroom instruction: Comparing student and teacher reports. Educational Policy, 24(2), 267–329. https://doi.org/10.1177/0895904808330173

25.

Devaux

Sassi

(2016). Social disparities in hazardous alcohol use: Self-report bias may lead to incorrect estimates. European Journal of Public Health, 26(1), 129–134. https://doi.org/10.1093/eurpub/ckv190

26.

Ding

Liu

Cai

(2022). Mathematics learning in Chinese contexts. ZDM – Mathematics Education, 54(3), 477–496. https://doi.org/10.1007/s11858-022-01385-z

27.

Doan

Mihaly

(2020). Chapter 23: Regression analysis. In OECD (Ed.), OECD Global teaching insights: technical report. Paris, France: OECD Publishing.

28.

Elbers

(2003). Classroom interaction as reflection: Learning and teaching mathematics in a community of inquiry. Educational Studies in Mathematics, 54, 77–99. https://www.jstor.org/stable/3483216

29.

Fauth

Decristan

Decker

A.-T.

Buttner

Hardy

Klieme

Kunter

(2019). The effects of teacher competence on student outcomes in elementary science education: The mediating role of teaching quality. Teaching and Teacher Education, 86, 102882. https://doi.org/10.1016/j.tate.2019.102882

30.

Fauth

Decristan

Rieser

Klieme

Büttner

(2014). Student ratings of teaching quality in primary school: Dimensions and prediction of student outcomes. Learning and Instruction, 29, 1–9. https://doi.org/10.1016/j.learninstruc.2013.07.001

31.

Fauth

Wagner

Bertram

Göllner

Roloff

Lüdtke

Polikoff

M. S.

Klusmann

Trautwein

(2020). Don’t blame the teacher? The need to account for classroom characteristics in evaluations of teaching quality. Journal of Educational Psychology, 112(6), 1284–1302. https://doi.org/10.1037/edu0000416

32.

Fennema

Carpenter

T. P.

Franke

M. L.

Levi

Jacobs

V. R.

Empson

S. B.

(1996). A longitudinal study of learning to use children’s thinking in mathematics instruction. Journal for Research in Mathematics Education, 27(4), 403–434. https://psycnet.apa.org/record/1996-04933-002

33.

Fenstermacher

G. D.

Richardson

(2005). On making determinations of quality in teaching. Teachers College Record, 107(1), 186–213. http://doi.org/10.1111/j.1467-9620.2005.00462.x

34.

Finch

W. H.

(2019). Exploratory factor analysis (Vol. 182). Los Angeles, CA: Sage Publications.

35.

Gaertner

Brunner

(2018). Once good teaching, always good teaching? The differential stability of student perceptions of teaching quality. Educational Assessment, Evaluation and Accountability, 30(2), 159–182. https://doi.org/10.1007/s11092-018-9277-5

36.

Goe

Bell

Little

(2008). Approaches to evaluating teacher effectiveness: A research synthesis. Washington, DC: National Comprehensive Center for Teacher Quality.

37.

Goos

(2004). Learning mathematics in a classroom community of inquiry. Journal for Research in Mathematics Education, 35(4), 258–291. https://doi.org/10.2307/30034810

38.

Grossman

Cohen

Ronfeldt

Brown

(2014). The test matters: The relationship between classroom observation scores and teacher value added on multiple types of assessment. Educational Researcher, 43(6), 293–303. https://doi.org/10.3102/0013189X14544542

39.

Hansen

W. B.

Pankratz

M. M.

Bishop

D. C.

(2014). Differences in observers’ and teachers’ fidelity assessments. The Journal of Primary Prevention, 35, 297–308. https://doi.org/10.1007/s10935-014-0351-6

40.

Hill

H. C.

Charalambous

C. Y.

Kraft

M. A.

(2012). When rater reliability is not enough: Teacher observation systems and a case for the generalizability study. Educational Researcher, 41(2), 56–64. https://doi.org/10.3102/0013189X12437203

41.

Hill

H. C.

Kapitula

Umland

(2011). A validity argument approach to evaluating teacher value-added scores. American Educational Research Journal, 48(3), 794–831. https://doi.org/10.3102/0002831210387916

42.

International Project Consortium. (2020). TALIS video observation codes: Holistic domain ratings and components. In OECD (Ed.), OECD Global teaching insights: technical report. Paris, France: OECD Publishing.

43.

Kaufman

J. H.

Stein

M. K.

Junker

(2016). Factors associated with alignment between teacher survey reports and classroom observation ratings of mathematics instruction. The Elementary School Journal, 116(3), 339–364. http://doi.org/10.1086/684942

44.

Klieme

(2020). Chapter 3: Curriculum mapping. In OECD (Ed.), Global teaching insights: technical report. Paris, France: OECD Publishing.

45.

Koziol

S. M.

Jr Burns

(1986). Teachers’ accuracy in self-reporting about instructional practices using a focused self-report inventory. The Journal of Educational Research, 79(4), 205–209. http://www.jstor.org/stable/27540198

46.

Kunter

Baumert

(2006). Who is the expert? Construct and criteria validity of student and teacher ratings of instruction. Learning Environments Research, 9, 231–251. https://doi.org/10.1007/s10984-006-9015-7

47.

Lauermann

Berger

J. L.

(2021). Linking teacher self-efficacy and responsibility with teachers’ self-reported and student-reported motivating styles and student engagement. Learning and Instruction, 76, 101441. https://doi.org/10.1016/j.learninstruc.2020.101441

48.

Lauermann

ten Hagen

(2021). Do teachers’ perceived teaching competence and self-efficacy affect students’ academic outcomes? A closer look at student-reported classroom processes and outcomes. Educational Psychologist, 56(4), 265–282. https://doi.org/10.1080/00461520.2021.1991355

49.

Lei

Leroux

A. J.

(2018). Does a teacher’s classroom observation rating vary across multiple classrooms? Educational Assessment, Evaluation and Accountability, 30, 27–46. https://doi.org/10.1007/s11092-017-9269-x

50.

Leung

F. K. S.

(2001). In search of an east Asian identity in mathematics education. Educational Studies in Mathematics, 47(1), 35–51. https://doi.org/10.1023/A:1017936429620

51.

Leung

F. K. S.

(Eds.). (2010). Reforms and issues in school mathematics in East Asia: Sharing and understanding mathematics education policies and practices. Rotterdam, The Netherlands: Sense Publishers.

52.

Little

Goe

Bell

(2009). A practical guide to evaluating teacher effectiveness. Washington, DC: National Comprehensive Center for Teacher Quality.

53.

Mayer

D. P.

(1999). Measuring instructional practice: Can policymakers trust survey data? Educational Evaluation and Policy Analysis, 21(1), 29–45. https://doi.org/10.3102/01623737021001029

54.

McCaffrey

D. F.

Hamilton

L. S.

Stecher

B. M.

Klein

S. P.

Bugliari

Robyn

(2001). Interactions among instructional practices, curriculum, and student achievement: The case of standards-based high school mathematics. Journal for Research in Mathematics Education, 32(5), 493–517. https://doi.org/10.2307/749803

55.

Ministry of Education of the People’s Republic of China. (2011). Chinese mathematics curriculum standards for compulsory education. Beijing, China: Beijing Normal University Press.

56.

Ministry of Education of the People’s Republic of China. (2022). Chinese mathematics curriculum standards for compulsory education. Beijing, China: Beijing Normal University Press.

57.

Mullens

J. E.

Gayler

(1999). Measuring classroom instructional processes: Using survey and case study field test results to improve item construction. Washington, DC: National Center for Education Statistics.

58.

Nathan

M. J.

Knuth

(2003). A study of whole classroom mathematical discourse and teacher change. Cognition and Instruction, 21(2), 175–207. https://www.jstor.org/stable/3233880

59.

National Council of Teachers of Mathematics. (2000). Principles and standards for school mathematics. Reston, VA: NCTM.

60.

National Council of Teachers of Mathematics. (2014). Principles to actions: Ensuring mathematical success for all. Reston, VA: NCTM.

61.

Newfield

(1980). Accuracy of teacher reports: Reports and observations of specific classroom behaviors. The Journal of Educational Research, 74(2), 78–82. http://www.jstor.org/stable/27507243

62.

OECD. (2020a). Global teaching insights: a video study of teaching. Paris, France: OECD Publishing. https://doi.org/10.1787/20d6f36b-en

63.

OECD. (2020b). Global teaching insights: observation tools. Paris, France: OECD Publishing.

64.

Opfer

V. D.

(2020). Chapter 1: An overview of the study. In OECD (Ed.), OECD Global teaching insights: technical report. Paris, France: OECD Publishing.

65.

Opfer

V. D.

Bell

C. A.

Klieme

McCaffrey

D. F.

Schweig

J. D.

Stecher

B. M.

(2020). Understanding and measuring mathematics teaching practice. In OECD (Ed.), OECD global teaching InSights: A video study of teaching. Paris, France: OECD Publishing.

66.

Pianta

R. C.

Hamre

B. K.

(2009). Conceptualization, measurement, and improvement of classroom processes: Standardized observation can leverage capacity. Educational Researcher, 38(2), 109–119. https://doi.org/10.3102/0013189X09332374

67.

Porter

A. C.

Kirst

M. W.

Osthoff

E. J.

Smithson

J. S.

Schneider

S. A.

(1993). Reform up close: An analysis of high school mathematics and science classrooms. Madison, WI: University of Wisconsin–Madison, Wisconsin Center for Education Research.

68.

Praetorius

A. K.

Fischer

Klieme

(2020a). Questionnaire development. In: OECD (Ed.), Global teaching insights: technical report. Paris, France: OECD Publishing.

69.

Praetorius

A. K.

Fischer

Klieme

(2020b). Teacher and student questionnaire development. In OECD (Ed.), Global teaching insights: technical report. Paris, France: OECD Publishing.

70.

Praetorius

A. K.

Klieme

Herbert

Pinger

(2018). Generic dimensions of teaching quality: The German framework of three basic dimensions. ZDM – Mathematics Education, 50(3), 407–426. https://doi.org/10.1007/s11858-018-0918-4

71.

Praetorius

A. K.

Lauermann

Klassen

R. M.

Dickhäuser

Janke

Dresel

(2017). Longitudinal relations between teaching-related motivations and student-reported teaching quality. Teaching and Teacher Education, 65, 241–254. https://doi.org/10.1016/j.tate.2017.03.023

72.

Praetorius

A. K.

Lenske

Helmke

(2012). Observer ratings of instructional quality: Do they fulfill what they promise? Learning and Instruction, 22(6), 387–400. https://doi.org/10.1016/j.learninstruc.2012.03.002

73.

Praetorius

A. K.

Pauli

Reusser

Rakoczy

Klieme

(2014). One lesson is all you need? Stability of instructional quality across lessons. Learning and Instruction, 31, 2–12. https://doi.org/10.1016/j.learninstruc.2013.12.002

74.

Reddy

L. A.

Dudek

C. M.

Fabiano

G. A.

Peters

(2015). Measuring teacher self-report on classroom practices: Construct validity and reliability of the classroom strategies scale – teacher form. School Psychology Quarterly, 30(4), 513–533. https://doi.org/10.1037/spq0000110

75.

Reddy

L. A.

Hua

Dudek

C. M.

Kettler

R. J.

Lekwa

Arnold-Berkovits

Crouse

(2019). Use of observational measures to predict student achievement. Studies in Educational Evaluation, 62, 197–208. https://doi.org/10.1016/j.stueduc.2019.05.001

76.

Rieser

Fauth

Decristan

Klieme

Büttner

(2013). The connection between primary school students’ self-regulation in learning and perceived teaching quality. Journal of Cognitive Education and Psychology, 12(2), 138–156. https://doi.org/10.1891/1945-8959.12.2.1

77.

Ross

J. A.

McDougall

Hogaboam-Gray

LeSage

(2003). A survey measuring elementary teachers’ implementation of standards-based mathematics teaching. Journal for Research in Mathematics Education, 34(4), 344–363. https://doi.org/10.2307/30034787

78.

Schiefele

Schaffner

(2015). Teacher interests, mastery goals, and self-efficacy as predictors of instructional practices and student motivation. Contemporary Educational Psychology, 42, 159–171. https://doi.org/10.1016/j.cedpsych.2015.06.005

79.

Schneider

Carnoy

Kilpatrick

Schmidt

W. H.

Shavelson

R. J.

(2007). Estimating causal effects using experimental and observational designs (report from the Governing Board of the American Educational Research Association Grants Program). Washington, DC: American Educational Research Association.

80.

Staples

(2007). Supporting whole-class collaborative inquiry in a secondary mathematics classroom. Cognition and Instruction, 25(2–3), 161–217. https://www.jstor.org/stable/27739857

81.

Stigler

J. W.

Hiebert

(2009). The teaching gap: Best ideas from the world’s teachers for improving education in the classroom. New York, NY: Free Press.

82.

Stone

A. A.

Bachrach

C. A.

Jobe

J. B.

Kurtzman

H. S.

Cain

V. S.

(Eds.). (2000). The science of self-report: Implications for research and practice. Mahwah, NJ: Lawrence Erlbaum Publishers.

83.

Tabachnick

B. G.

Fidell

L. S.

(2018). Using multivariate statistics (7th ed.). New York, NY: Pearson Education.

84.

Tourangeau

Rips

L. J.

Rasinski

(2000). The psychology of survey response. Cambridge, UK: Cambridge University Press.

85.

Tsai

Bergin

Jones

(2022). Students in 4th to 12th grade can distinguish dimensions of teaching when evaluating their teachers: A multilevel analysis of the TESS survey. Educational Studies, 1–16. https://doi.org/10.1080/03055698.2022.2058319

86.

Van de Vijver

F. J.

(2014). Report on social desirability, midpoint and extreme responding in TALIS 2013. Paris, France: OECD. https://doi.org/10.1787/5jxswcfwt76h-en

87.

Van der Lans

R. M.

(2018). On the “association between two things”: The case of student surveys and classroom observations of teaching quality. Educational Assessment, Evaluation and Accountability, 30, 347–366. https://doi.org/10.1007/s11092-018-9285-5

88.

Wagner

Göllner

Werth

Voss

Schmitz

Trautwein

(2016). Student and teacher ratings of instructional quality: Consistency of ratings over time, agreement, and predictive power. Journal of Educational Psychology, 108(5), 705–721. https://doi.org/10.1037/edu0000075

89.

Weston

T. J.

Hayward

C. N.

Laursen

S. L.

(2021). When seeing is believing: Generalizability and decision studies for observational data in evaluation and research on teaching. American Journal of Evaluation, 42(3), 377–398. https://doi.org/10.1177/1098214020931941

90.

Wisniewski

Zierer

Dresel

Daumiller

(2020). Obtaining secondary students’ perceptions of instructional quality: Two-level structure and measurement invariance. Learning and Instruction, 66, 101303. https://doi.org/10.1016/j.learninstruc.2020.101303

91.

Zhao

Mok

I. A. C.

Cao

(2016). Curriculum reform in China: Student participation in classrooms using a reformed instructional model. International Journal of Educational Research, 75, 88–101. https://doi.org/10.1016/j.ijer.2015.10.005

Comparing perceived and observed instructional practices and their predictive power for student mathematics achievement: An analysis of Shanghai data from OECD global teaching inSights

Abstract

Keywords

1. Introduction

2. Literature review

2.1 Instructional practices and student mathematics achievement

2.2 Teachers’ perceived and self-reported instructional practices

2.3 Student rated instructional practices

2.4 Observer rated instructional practices

2.5 Alignment of three types of ratings and their predictive power

3. The current study

3.1 The quality teaching framework of the GTI study

4. Method

4.1 Data source

4.2 Instruments

4.2.1 Student and teacher prequestionnaire and postquestionnaire

4.4 Data analytic approach

5 Results

5.1 RQ1: Conceptual components of instructional practices reported by teachers and students

5.3.2 Student perspective

5.3.3 Rater perspective

6 Discussions

7 Limitations, implications, and future directions

8 Conclusion

Footnotes

Acknowledgements

Contributorship

Declaration of conflicting interests

Funding

ORCID iD

Author biographies

References