Sage Journals: Discover world-class research

Abstract

Aggregated data meta-analyses indicate a correlation between instructional leadership and student achievement. However, it is unclear to what extent this relationship can be generalized across cultural contexts, as most primary studies stem from Anglophone regions. Drawing on international large-scale assessment data, this 3-level individual participant data (IPD) meta-analysis examines this relationship over a 6-year period using a sample of 1.5 million students in more than 50,000 schools from 75 countries. The findings show that the mean correlation is close to 0 and that the relationship between instructional leadership and student achievement varies significantly across contexts. This is mainly due to the level of human development and cultural factors. Implications for policy, practice, and education research are discussed.

Keywords

achievement cross-cultural studies hierarchical linear modeling human development instructional leadership meta-analysis principals school/teacher effectiveness secondary data analysis

The relationship between instructional leadership (IL) in schools and student achievement has been debated and empirically studied for decades (Hallinger et al., 2020; Robinson et al., 2008; Shatzer et al., 2013). After the emergence of international large-scale assessment studies (ILSAs; Smith, 2016) and a growing focus on accountability in education (O’Donnell & White, 2005), the idea that leaders should attend to the “core business of schools, instruction” (Urick & Bowers, 2019, p. 251), which originated in the Anglophone world (Hallinger et al., 2020), has proliferated globally. The accompanying cross-national policy borrowing—that is, the introduction of policies from one context into another education system (Phillips & Ochs, 2003)—has thus led to global alignment in the training of school principals as instructional leaders and the related expectations placed on them (Harris et al., 2016; Pont et al., 2008).

However, context plays an important role in how leadership is defined and practiced in schools, as well as whether, how, and in what ways it becomes effective (Luschei & Jeong, 2021; Shaked, 2020; Tan, 2018). While research on educational leadership has identified a set of generic leadership styles and behaviors school leaders in various contexts fall back on, the enactment, mechanisms and results of educational leadership remain context-dependent (Walker & Ko, 2011). Consequently, educational leadership, its modus operandi, and its effects are shaped by culture, as well as by other national, regional, and local circumstances (Hallinger, 2018). Nevertheless, until now, “the context in which leadership is enacted has not received much attention” (Antonakis et al., 2004, p. 60) in empirical leadership studies and furthermore is clearly undertheorized, particularly regarding IL (Gümüş et al., 2021; Hallinger, 2018).

This is surprising, as scholars have long indicated that differences in context can be so great that entirely different concepts can be required to answer the same research question about leadership in different contexts (Rousseau & Fried, 2001). In the management and organizational literature, some use the terms “emic” and “etic” to describe the tension that arises from this (Den Hartog et al., 1999; He & van de Vijver, 2015; Klein et al., 2022), with “emic” relating to context-dependent uniqueness and “etic” referring to cross-contextual generalizability (Berry, 1989; Helfrich, 1999). As van de Vijver and He (2017) and others (F. M. Cheung et al., 2011; Schaffer & Riordan, 2003) argue, this dichotomy is nevertheless problematic for cross-cultural research, as it implies that cultural phenomena are either culture-specific or universal, consequently rendering it impossible to study their combination. Following this critique, in cross-cultural leadership research, “a combined approach taking into consideration both the methodological rigor and cultural sensitivity in a study may be preferred” (He & van de Vijver, 2015, p. 190). Thus, He and van de Vijver (2015) suggest three different routes to bridge the emic–etic divide: (a) combining emic and etic approaches, (b) minimizing method bias to reduce nuisance and increasing equivalence between different contexts, and (c) employing multi-level analyses considering culture-level predictors.

Most studies to date empirically investigating IL (and its relation to student achievement) have originated from the Anglophone region (Wu & Shen, 2022). Quantitative cross-cultural comparative studies even with three or more countries are rare in educational leadership research, and there have been nearly no large-scale quantitative comparative studies on the topic of IL (Hallinger, 2018). However, a recent cross-cultural meta-analysis conducted by Karadag (2020) indicated that the majority of studies between 2008 and 2018 investigating the relationship between educational leadership and student achievement came from the Anglophone world. Approximately 95% of all research on the topic is from the “core Anglosphere” (Vucetic, 2020), a group of anglophone countries representing about 5% of the global population in 2023. As a result, research on the IL–student achievement relationship is emic, as it examines leadership in schools connected to the Anglophone region (Schaffer & Riordan, 2003), which—as in other research areas (e.g., Arnett, 2008; Henrich et al., 2010; Thalmayer et al., 2021)—raises the question of how far scholars can generalize such findings on a global scale when they come “from a small and unrepre-sentative portion of the global population” (Thalmayer et al., 2021, p. 116).

Following the works of van de Vijver (He & van de Vijver, 2015; van de Vijver et al., 2010; van de Vijver & Leung, 2021), we, therefore, bridge the emic/etic tension in cross-cultural research by adopting a novel methodological approach to large data sets called split/analyze/meta-analyze (SAM; M. W. L. Cheung & Jak, 2016). This approach is comparable with a two-step individual participant data (IPD) analysis (Brunner et al., 2022; Campos et al., 2023; Jak & Cheung, 2020) for performing meta-analytical integrative and multilevel data analyses that follow comparable analytical protocols (Brunner et al., 2022; L. Keller et al., 2021).

Using this approach, referred to by some as the “gold standard” of meta-analysis (Chalmers, 1993; Debray et al., 2015; Stewart & Tierney, 2002), we re-analyzed raw data from three cycles of the Organization for Economic Co-operation and Development’s (OECD’s) Program for Inter-national Student Assessment (PISA). Exclud-ing countries with no leadership survey data, the sample included K₁ = 73, K₂ = 65, and K₃ = 70 (for a total of K = 75) countries or regions; J₁ = 18,162, J₂ = 17,549, and J₃ = 16,400 schools; and N₁ = 515,958, N₂ = 480,174, and N₃ = 519,334 students across 2009, 2012, and 2015 cycles, respectively.

By this means, our study addresses the following two research questions:

Research Question 1: What is the overall effect size computed based on the relation between school-level IL scores and the school component of student achievement scores, and how heterogeneous are these effects across countries, cycles, and achievement domains?

Research Question 2: To what extent can observed heterogeneity in the IL–student achievement relation be attributed to cultural and other contextual characteristics?

Literature Review

Origins of the IL Concept and Relevant Research

The concept of IL emerged in the United States in the 1930s (Hallinger et al., 2020) and has been one of the most popular research topics in the field of educational administration since the 1980s. From the outset, IL’s focus was on optimizing student learning by actively influencing teaching and instruction in such a way that “develops boys and girls in their full of their capacities in accordance with the rights and responsibilities of the American way of life” (Spears, 1941, p. 31). Consequently, as Gray (1934, p. 417) explained in one of the first scholarly works on the subject: “The unique function of instructional leadership is the improvement of teaching.”

In the early years, “instructional leader,” was correspondingly used as a general term to refer to school principals who frequently engage in practices focusing on teaching and learning activities (Bridges, 1967; Brieve, 1972; Edmonds, 1979). While a range of practices related to IL was identified in these years, it was not until the 1980s that comprehensive frameworks were developed to delineate the boundaries of IL and clarify what exactly it entails. The concept’s increased popularity after the 1980s can be attributed at least partly to the emergence of more concrete conceptual models of IL (Bossert et al., 1982; Hallinger & Murphy, 1985; Murphy, 1990). Among others, Hallinger’s model, including its related mea-surement tool—the Principal Instructional Man-agement Rating Scale (PIMRS)—has become predominant (Hallinger, 2011b). The model proposes three dimensions of IL: defining a school mission, managing the instructional program, and developing a positive school learning climate. While the first dimension includes framing and communicating the school goals, the second dimension covers a range of practices such as coordinating the curriculum, supervising and evaluating instruction, monitoring student progress, protecting instructional time, and providing incentives for teaching and learning. Finally, the last dimension involves promoting professional development among teachers and maintaining high visibility in the school.

Extant research on IL has paid great attention to its impact on various school characteristics and outcomes, such as teacher behaviors and development (e.g., efficacy, job satisfaction, commitment, and professional learning), school climate, school effectiveness, and instructional quality (Boyce & Bowers, 2018; Hallinger, 2005; Neumerski, 2013; Pietsch & Tulowitzki, 2017). This line of research has mostly confirmed the positive role of IL in the improvement of such characteristics and outcomes. As expected, the link between IL and student achievement has also received increasing attention in the literature, with encouraging positive findings from various contexts (Hallinger et al., 1996; Robinson et al., 2008; Shatzer et al., 2013). It should be mentioned, however, that most of these studies are based on a traditional conceptualization of IL which prioritizes the role of principals in the improvement of teaching and learning processes. Such studies are also largely based on single sources of data, typically principals or teachers, and cross-sectional designs (Hallinger, 2011b; Özdemir et al., 2022).

IL and Student Achievement

Both the actual conceptualization and the research on IL are deeply rooted in the 1970s effective schools movement (Sun & Leithwood, 2015), which centered on the assumption that improving a school’s quality has the power to increase student achievement (Teddlie & Reynolds, 2000). Seen from this angle, “an effective school is roughly the same as a ‘good’ school [and] performance of the school can be expressed as the output of the school, which in turn is measured in terms of the average achievement of the pupils” (Scheerens, 2000, p. 18). Hence, based on the initial finding that a school leader’s focus on teaching and instruction is strongly associated with improved learning outcomes for students in low-socioeconomic status (SES) schools (Edmonds, 1979), it is not surprising that the relationship between IL and student achievement has been central to the study of the determinants of effective schooling ever since (Hallinger, 2005; Neumerski, 2013).

While it was originally assumed that school leaders had a direct impact on student learning through hands-on leadership practices (Brookover et al., 1979; Rutter et al., 1979), it has become clear over time that such influences are more indirect and mediated by other within-school factors (Bossert et al., 1982; Heck et al., 1990; Scheerens, 2012). Existing research suggests that school leaders are most likely to affect student achievement through their influence on teacher beliefs, practices, and emotions, as well as on school structure and climate (Hallinger & Heck, 1998; Leithwood et al., 2019; Sebastian & Allensworth, 2012). In addition, the effects of leadership might be differentiated based on contextual factors, such as the socioeconomic background of students (Tan, 2018). However, only a few studies have utilized moderation or moderation–mediation analyses to investigate the conditional effects of IL on student achievement (Gümüş et al., 2022; Sebastian & Allensworth, 2019).

Although there are now various research approaches to explain how and to what extent leadership is related to student achievement (Leithwood et al., 2019; Seashore Louis et al., 2010; Witziers et al., 2003) and thus try to clarify the modus operandi, Hallinger’s (2011a, 2011b) reviews of decades of research in this area have concluded that the relationship between leadership and student achievement remains complex and ambiguous. It is clear that such a relationship cannot be fully explained by narrowly defined leadership approaches. Recent research, therefore, assumes that IL merges into leadership for learning (Boyce & Bowers, 2018), a blend of leadership practices that is not solely the responsibility of individuals in formal leadership roles (Daniëls et al., 2019), and that such blended leadership unfolds its impact as a school-wide practice of all those involved in the school (J. Ahn et al., 2021).

Although it is still largely unclear how IL and student achievement are related, various meta-analytical studies suggest that there is an observable and significant relationship between these two aspects. In this regard, it should be noted that all available meta-analyses are limited in various aspects. First, no study has yet to incorporate integrative (Curran & Hussong, 2009) or IPD meta-analysis (Debray et al., 2015), which allows for direct meta-analysis of all raw data at the unit level (e.g., students, schools, and so on) that come from multiple studies, which is clearly superior to pooling (Bravata & Olkin, 2001). Second, all available aggregated data meta-analyses include an excessive number of primary studies from the Anglophone world, especially the United States (Wu & Shen, 2022). Third, as the extant studies apply different methodologies and varying research protocols, there is a great amount of methodological heterogeneity between these studies, which can be considered an important bias factor (L. Keller et al., 2021). Consequently, Debray et al. (2015, p. 249) argue: “Unfortunately, when synthesizing published AD [aggregated data], even rigorously conducted meta-analyses can be of limited value.”

Taking these limitations into account, the relationship between school leadership and student achievement is generally rather moderate and, according to the results of a recent meta-meta-analysis by Wu and Shen (2022), should be around r = .17. Grissom et al. (2021), considering only studies with panel data, found in another meta-analysis that school leadership was associated with r = .06 in mathematics achievement and r = .05 in reading achievement. Scheerens (2012) found similarly high effect sizes with regard to direct (r = .05) and indirect effect (i.e., mediation, r = .06) models in his meta-analysis. In this respect, the effects differ significantly depending on the study design and are significantly smaller in longitudinal than in cross-sectional analyses. However, there seems to be hardly any difference between analyses that take mediator variables into account and those that do not.

In a recent cross-cultural meta-analysis of the relationship between school leadership and student achievement based on 151 articles and dissertations across nations, Karadag (2020) showed that IL has the highest effect (r = .34) on student achievement compared with distributed (r = .30) and transformational (r = .27) leadership models, as well as leadership practices (r = .27). Moreover, in their well-known meta-analysis of studies from the 1970s to the 2000s, Robinson et al. (2008) demonstrated a mean correlation between IL and student achievement of r = .21. Hattie (2017) identified the similar value in his research synthesis on correlates of student achievement. However, Tan et al. (2020) showed in a recent second-order meta-analysis that the correlation of IL with various school outcomes is r = .27, while the correlation is relatively higher for distributed (r = .30) and transformational leadership (r = .34).

On the contrary, other recent meta-analyses suggest that this association might be too high. Tan et al. (2021) found that a focus by school leaders on enhancing teaching and learning was associated with student achievement with an effect size of r = .11 and that effects further varied by achievement domain, with the effects of leadership being strongest in languages and mathematics but undetectable in science. A similar result was reached by Liebowitz and Porter (2019), who found a correlation of r = .11 between IL practices and student achievement in a meta-analysis of 51 studies from 2001 to 2019; they thus concluded that “previous literature may overstate the unique student achievement effects of principals’ time spent on and skill in instructional leadership behaviors” (p. 814).

The Influence of Context on IL and Student Achievement

School leadership and educational performance are context-dependent (Jacobson, 2011; Luschei & Jeong, 2021; Miller, 2018). Nonetheless, the concept of IL as an indicator of successful or effective schooling has spread around the globe, gaining momentum since the turn of the millennium (Hallinger et al., 2020; Reynolds et al., 2014). This is not least due to the fact that ILSA programs conducted around the world since the late 1990s, such as the Progress in International Reading Literacy Study (PIRLS) or PISA, have served as catalysts for the global transfer of educational concepts and policy solutions (Heffernan, 2018; Lewis & Lingard, 2015; Lingard et al., 2015). Thus, as Smith (2016, p. 6) notes, on the one hand, “the reinforcing nature of the global testing culture leads to an environment where testing becomes synonymous with accountability which becomes synonymous with education quality.” On the other hand, ILSAs have led to a proliferation of evidence-based education that, in the tradition of school effectiveness research, focuses on the question of what works (Ydesen & Andreasen, 2020), and thus also to the global transfer of educational concepts, including IL (Bush, 2022; Hallinger & Wang, 2015).

Concepts developed mostly in the Anglophone countries, however, may not be transferable to other parts of the world or, at least, might be “incomplete” (Bowers, 2020, p. 53) in other contexts. Cultures matter on the subject of leadership in all regards (Dickson et al., 2012). Culturally endorsed implicit leadership theories include the cognitive schemata, prototypes, or societal or cultural leadership ideals people in different cultures hold about leaders and leadership, and shape what is understood by leadership, how and what people think about leadership, how leadership is enacted, and how leadership is examined in the context of scientific studies (Den Hartog et al., 1999; T. Keller, 1999; Shaw, 1990). Ultimately, this also affects the effectiveness of leadership, as with regard to cultural sensitivity “effective leadership is . . . in the eye of the beholder” (Judge et al., 2006, p. 203).

There are, however, recent efforts to turn around the concepts exported from the respective cultural and national contexts of the Anglophone world and adapt them to new educational traditions and circumstances (Fischman et al., 2019; Lewis & Lingard, 2015; Niemann et al., 2018). As for the concept of IL, increased studies indicate that it is contextually adapted and shaped not only by culture but also by national, regional, and local circumstances (Bowers, 2020; Brauckmann et al., 2016; Gümüş et al., 2021; Shaked, 2020; Shaked et al., 2021). This contextual sensitivity is also observable in the relationship between IL and potential dependent variables, like student achievement. While studies from Anglophone countries suggest that there is a positive relationship between IL and student achievement, findings from other countries and cultures are sometimes contradictory. For instance, Mora-Ruano et al. (2021) re-analyzed PISA 2015 data from Germany and found negative associations between IL and student achievement. A meta-analysis by Witziers et al. (2003) confirmed that studies conducted in the Netherlands reported significantly lower effect sizes than studies conducted in the United States. Another recent meta-analysis highlighted the differentiated effects of school leadership in various cultures by show-ing “that the effect of educational leader-ship on students’ achievement is significantly higher in vertical-collectivist cultures (r = .52) than horizontal-individualistic cultures (r = .31), (Qb = 11.5, p < .05)” (Karadag, 2020, p. 59).

But what could be the reason for this? As Osborn et al. (2014) state, there are three strands of discussion on the relationship between leadership, leadership effects, and context: (a) leader-centric approaches, where context is seen as an influence on various criteria that can be influenced and changed by leadership; (b) situational approaches, which assume that leadership and its effects on dependent variables are contingent; and (c) embedded approaches, where “context is a dominant factor that establishes boundary conditions on the type of leadership displayed as well as the effectiveness of leadership” (Osborn et al., 2014, p. 593).

Cross-cultural research on IL suggests that the third approach may be instrumental in explaining variation in the relationship between IL and student achievement. For example, according to the OECD’s (2009a) Teaching and Learning International Survey (TALIS) findings, school leaders in countries such as Denmark, Iceland, Spain, and Belgium are much less likely to practice IL than their counterparts in Bulgaria, Turkey, Malaysia, or Mexico. Correlations with other school variables, in turn, varied greatly depending on the context. Associations with teaching quality were found only in very few countries (e.g., Malta and Iceland), but with teacher professionalization and innovative teaching practices in many countries. It is striking that the correlations are most evident in countries where IL is used frequently. The related research has also highlighted that emphasis on high-stakes tests at the national level is an important driver of school principals’ focus on improving student achievement (Ng et al., 2015; Walker et al., 2011). This indeed has the power to influence how and to what extent school principals enact various IL practices. In sum, both IL itself and its relationship with other variables, and thus eventually its effectiveness, depend on the context in which IL is enacted.

The reasons to which these differences might be attributed are numerous, although the subject is largely undertheorized (Clarke & O’Donoghue, 2017; Hallinger, 2018). One of the few exceptions is the model of Dimmock and Walker (1998, 2005), which is based on Hofstede et al.’s (2010) framework of cultural dimensions. Here, the authors assume that the cultural code—shared attitudes, values, goals, and practices, “from society, region and locality [and] organizational culture” (Dimmock & Walker, 2005, p. 24)—interacts with educational leadership, its modus operandi, and its effects. There is growing evidence showing that educational leadership and its effects may vary due to partially interacting local, regional, national, cultural, and economic factors (Bowers, 2020; Hallinger, 2018; Miller, 2018).

Furthermore, there is increasing empirical evidence showing that the interaction of educational leadership, student achievement, and the environment is closely interwoven in everyday life (Day et al., 2016; Forrester & Gunter, 2010). Hallinger (2018) posits that the economic, political, and socio-cultural context should be of general influence, while community, personal, and institutional context should have a direct influence on IL, its modus operandi, and its effectiveness. On the one hand, rather proximal factors, like the culture of a school (Sahin, 2011) and trust (Lubis et al., 2021), along with the socioeconomic environment of the school (Gümüş et al., 2022; Hallinger et al., 1996; Pietsch & Leist, 2019; Urick et al., 2022) play a major role in how IL is understood and enacted effectively. On the other hand, more distant factors, such as cultural values (Karadag, 2020) and educational policies (M. Lee & Hallinger, 2012), which are largely beyond the influence of local actors and require specific interpretation in concrete situations (Noman & Gurr, 2020), may also influence the effectiveness of IL.

This Study

With the literature in mind, the follow-ing desiderata can be identified: (a) It is well known that leadership and its effectiveness are context-dependent, with the cultural context, in particular, having a major influence; (b) a majority of the empirical studies on the relationship between IL and student achievement come from the Anglophone world, namely the “core Anglosphere”; (c) large-scale quantitative cross-cultural studies investigating the IL–student achievement relation are nearly absent; (d) the studies conducted so far show a high degree of methodological heterogeneity, which might have an own impact on the generalizability of findings; and (e) all previous best evidence studies (i.e., systematic reviews and aggregated data meta-analyses) are based on these data.

It is therefore unclear to what extent the previous results can be generalized at a larger level and are thus “etic” (Berry, 1969; Brett et al., 1997; Den Hartog et al., 1999), which “emic” contextual factors (can) modulate differences, and how robust the results reported in traditional meta-analyses are when both methodological heterogeneity is reduced and a variety of contexts are considered simultaneously. Hence, this study has two main goals. The first goal is to evaluate the “etic” generalizability of the IL–student achievement association across cultures and time (within cultures, or more precisely, countries). The second goal is to clarify which “emic” macro factors moderate this relationship and thus limit generalizability.

Methods

Data Source

Against this background, this study examines the association between IL and educational achievement in 75 countries and regions worldwide over the course of 6 years. In this regard, the study employs secondary data analysis of the OECD’s PISA 2009, 2012, and 2015. The main objective of PISA studies is to regularly collect and provide internationally comparable data on the performance of education systems, as measured by student achievement or literacy. The target group is 15-year-old secondary school students. The domains of mathematics, reading, and science are continually assessed with the help of competency tests. However, PISA also collects a wide range of other information on education-related topics. In 2009, 2012, and 2015, the international school questionnaires, completed by school leaders, also asked questions about leadership in schools.

Participating countries include OECD nations, as well as partner countries and regions. Typically, a minimum of 150 schools and 4,500 students per country are required; however, if a participating country had fewer than 150 schools, all schools participated. To ensure nationally representative samples for each participating country, PISA uses a two-stage sampling design, with students nested within schools. At stage 1, a random sample of schools is selected; at stage 2, 15-year-old students (35 students per school) are randomly selected within schools. A total of 18,162 schools with 515,958 students were included in the PISA 2009 data set, 17,549 schools with 480,174 students in PISA 2012, and 16,400 schools with 519,334 students in PISA 2015. Most nations participated in all 3 years, but some countries participated less frequently (e.g., Malta only in PISA 2009 and 2015, Panama only in PISA 2009, and Algeria only in PISA 2015). In addition, in rare cases, relevant variables were not collected; for example, in PISA 2009, the school questionnaire was not administered in France, so no data on leadership in French schools are available for that year. For an overview of which data from which countries were included in our analyses, see Supplemental Material in the online version of the journal.

Analytical Approach

Our study applies a novel meta-analytical big data approach suggested by M. W. L. Cheung and Jak (2016), called SAM. In this approach, a big data set is split into many smaller data sets, each of which is analyzed separately and then combined by using meta-analyses. This approach is comparable to a two-step IPD meta-analysis (Brunner et al., 2022; Campos et al., 2023; Jak & Cheung, 2020) by explicitly reducing method heterogeneity (or bias) between primary studies (Debray et al., 2015). Recently, Zhang et al. (2018) extended this approach by integrating a multilevel framework. Furthermore, Brunner et al. (2022), Campos et al. (2023), L. Keller et al. (2021), and Blömeke et al. (2021) adapted the approach to ILSAs and expanded it to an integrative data analysis, applying the same data analysis procedure for each subset of data.

In this context, one of the major advantages of using ILSA data for IPD meta-analyses is that these studies are based on large, representative probability samples, thus reducing sample selection bias and determining very precise effect sizes (Brunner et al., 2022). Moreover, such an analysis can simultaneously overcome several limitations of classical single-stage meta-analyses with aggregated data, such as the non-consideration of differences in survey characteristics and aggregation bias, as well as the non-comparability of findings between studies (through the use of different instruments and methods; Campos et al., 2023).

We also followed this procedure, whereby we performed a content validation of the PISA leadership scales in advance to keep them as comparable as possible to the PIMRS. Figure 1 shows a simplified graphical scheme of our SAM approach.

Figure 1.

Simplified scheme of the SAM procedure.

To address the concept of context, we followed the suggestions by S. Ahn et al. (2012) for evaluating effect heterogeneity in meta-analyses. We thus applied a systematic procedure to study the influence of different contextual characteristics on the variation of the identified correlations and therefore evaluate generalizability. Hence, in our understanding, context is anything that can threaten the generalizability of the IL–student achievement relation. In the final step, we evaluated the results of this theory driven-analysis by applying a fully data-driven machine-learning approach (Li et al., 2020), which enables identifying the interactions between moderators in meta-analyses by using a regression and classification tree mechanism (CART).

Computation of Leadership and Achievement Association

To compute the association between leadership and the school-level component of student achievement, the first step involved splitting the available three cycles of PISA data sets for each country, for each achievement domain, and for each plausible value. For example, the 2009 data set was first split by country variable, resulting in 73 different data sets; then, each country’s data set was further split by 3 achievement domains, resulting in 73 × 3 = 219 data sets. Finally, these 219 subsets were further split by plausible values, given that there were 5 values for PISA 2009; this final step resulted in 219 × 5 = 1,095 subsets. In other words, the data setup process was automated to split the PISA 2009 data set into 1,095 pieces. This was required to compute the relationship, as explained in the next paragraph.

The first two steps of SAM were completed using an R code (see Supplemental Material in the online version of the journal) to automatize the following: (a) separately for each cycle, the PISA data set (i.e., created by merging individual- and school-level PISA data) was split into subsets across countries, achievement domains, and further across each plausible value, resulting in a subset that had an achievement score, relevant sampling weights, and leadership scale item-level data; (b) for each subset, multilevel models using Mplus (Muthén & Muthén, 2021, see Supplemental Material in the online version of the journal) were employed to compute the association between the school-level component of the achievement and a school-level latent factor defined by leadership item scores; and finally, (c) the results were then collected and stored for the second stage.

The multilevel model mentioned in (b) utilized a multiple imputation approach to combine results obtained from the analysis of each plausible value (as recommended by Arıkan et al., 2020; OECD, 2009b) to analyze PISA data that recorded 5 plausible valuables in the 2009 and 2012 cycles and 10 values in 2015. In other words, following Campos et al. (2023), the imputation procedure at this point of analysis was not utilized to address missing data, but rather to combine results for different plausible values. Further in the multilevel model, both student and school weights (W_FSTUWT and W_FSCHWT resp. W_SCHGRNRABWT) were taken into account; the full information maximum likelihood to address missing data was used, as recommended by Hox et al. (2015); the student-level achievement variable was not declared as a within or between level variable, to activate the latent mean decomposition (Hamaker & Muthén, 2020); and finally, the standardized coefficients were computed to obtain the correlation between the school-level leadership latent factor and school-level latent component of the achievement variable, using STDYX command (Liu et al., 2014). In the end, a total of 597 correlation coefficients were computed separately for 75 countries across 3 cycles and 3 achievement domains.

Meta-Analysis

In the third step of SAM, we analyzed those 597 correlation coefficients by applying a meta-analysis. Treating the number of participating schools as the corresponding sample size, the correlation between IL and achievement was transformed using Fisher’s r-to-z (Fisher, 1921) formula implemented in the metafor package (Viechtbauer, 2010). These effect-size measures and their heterogeneity were investigated in detail, similarly to the meta-analysis steps followed by Williams et al. (2022). Conditional models included both cycle-specific and country-specific moderators, as explained in the “Measures” section. The mixed-effects meta-regression models were estimated via the rma.mv function in metafor using restricted maximum likelihood, in which effect sizes specific to achievement domains (i.e., math, reading, or science) were allowed to nest in each cycle, and cycles were allowed to nest in countries to account for the dependencies. Using Konstantopoulos’s (2011) notation, a three-level empty meta-analytic model reads as follows:

T_{i g} = ϑ_{i g} + ε_{i g}

(1)

ϑ_{i g} = β_{0 g} + η_{i g}

(2)

β_{0 g} = γ_{00} + υ_{0 g}

(3)

where $T_{i g}$ is the effect-size estimate based on the correlation between IL and achievement for country g (g = 1,2. . .,75) and cycle i (i = 1,2,3), with an assumption of known sampling variances (v_ig) computed based on school sample sizes in each cycle. The $ϑ_{i g}$ represents the second-level unknown effect-size parameter that varies around a level 3 unit g. The $β_{0 g}$ represents the third-level unknown effect-size parameter that varies around an overall mean $γ_{00}$ . The variation of $T_{i g}$ is captured by $η_{i g} ~ N (0, τ_{L 2}^{2})$ at level 2 and $υ_{0 g} ~ N (0, τ_{L 3}^{2})$ at level 3. A three-level conditional meta-analytic model in our study includes cycle-specific predictors in equation 2 and country-specific predictors in equation 3.

After each mixed-effects model, heterogeneity information was saved as follows: the estimate at the third level (i.e., country), hereafter τ_L3, and at the second level (i.e., cycles in countries), hereafter τ_L2, and a combined estimate $τ = \sqrt{τ_{L 3}^{2} + τ_{L 2}^{2}}$ , along with Cochran’s Q for the unconditional model and Q_E for the conditional models as the test statistic for the residual heterogeneity (Viechtbauer, 2010). Furthermore, robust variance estimation to adjust the standard errors and degrees of freedom for moderators was achieved by utilizing the clubSandwich package (Pustejovsky, 2022). The adjustment was based on the Satterthwaite approximation by declaring countries, the highest level, as clusters (Tipton, 2015; Tipton & Pustejovsky, 2015). Missing data were present, especially for the country-specific moderators; thus, we used the jomo package (Quartagno et al., 2019) to create 100 imputed data sets and then pooled the results using the R code provided by Williams et al. (2022) to take into account the within and between imputation variance (Barnard & Rubin, 1999; Pustejovsky, 2022).

Multiple Moderator Information Search

To identify interactions between the moderators, we utilized Li et al.’s (2020) multiple moderator meta-analytical CART method by employing the R package metacart with a limitation of ignored nested structure. This procedure searches for homogeneous subgroups within the given studies by combining moderators: “Unlike logistic and linear regression, CART does not develop a prediction equation. Instead, data are partitioned along the predictor axes into subsets with homogeneous values of the dependent variable—a process represented by a decision tree that can be used to make predictions from new observations” (Krzywinski & Altman, 2017, p. 757). As single regression trees are sensitive even to minor changes in the data, it is advisable to perform the corresponding analyses several times to achieve reliable results (Krzywinski & Altman, 2017). Accordingly, we ran Meta-CART analyses 100 times, each time for a multiple imputed data set. A successful Meta-CART implementation is expected not only to reveal additional predictive information but also to serve as a sensitivity analysis.

Measures

Student Achievement

PISA achievement scores are in the domains of reading, mathematics, and science. As each student only takes a subset of items from the bank of questions, the OECD reports multiple plausible values for each student in each domain, with the aim of approximating the true distribution of the latent achievement variable being measured. The data are available at the student level, and each has an international mean of 500 with a standard deviation of 100 test score points.

Instructional Leadership

School Leadership was introduced into PISA for the first time in 2009 and is measured at the school level through self-reports from school leaders. The scales used in PISA are based on scales developed in another OECD study, the TALIS 2008 (OECD, 2010). In TALIS, the concept of school leadership was mainly framed around the IL model. Therefore, the IL model has also dominated leadership items across different cycles of PISA, while other popular leadership models, such as transformational and distributed leadership, were not systematically represented (see Supplemental Material in the online version of the journal). However, rather than using items directly from any of the existing IL measurement tools (e.g., PIMRS: Hallinger & Murphy, 1985), the OECD developed its own questionnaires to measure school leadership, with some items that might go beyond the IL approach. At present, the OECD has created leadership scales based on data from three PISA cycles (2009, 2012, and 2015). The relevant scales included 14 items in 2009 and 13 in 2012 and 2015 (OECD, 2012, 2014, 2017). However, only five items with exactly the same wording were used at all three measurement points. Moreover, from PISA 2009 to PISA 2012, the measurement was changed from a 4-point to a 6-point Likert-type scale.

Nevertheless, establishing comparability of the measured variables is critical for making reliable statements under the approach we have chosen. However, if different measured variables are used, invariance tests are not applicable, so it is necessary to establish the conceptual comparability of the measured variables (Campos et al., 2023). With this in mind, we chose to reconsider the relevant items to measure IL and create new scales that come closer to the concept of the PIMRS across three PISA cycles. We intended to achieve content validity of the IL scales for different cycles by using expert judgment, following Lawshe’s (1975) method.

In the first step, we created an expert panel that included 10 international scholars from 5 different countries who had recently published on IL in the field’s core journals (e.g., Education-al Administration Quarterly; Educational Man-agement, Administration, and Leadership; Jour-nal of Educational Administration; etc.). We sent all 21 items included in OECD’s leadership scales across three PISA cycles to the panelists and asked them to rate items as “essential,” “not essential but useful,” and “not necessary” against the PIMRS framework. After collecting the experts’ ratings, we kept 15 items, based on the assumption that items receiving the rating of “essential” from more than half of the panelists had some degree of content validity (Lawshe, 1975; see Supplemental Material in the online version of the journal). Next, we assigned each of the selected items to three main dimensions of PIMRS, again based on the views of the panelists, to ensure that all dimensions were represented in each cycle (see Table 1).

Table 1

IL Items

PIMRS dimension	PISA 2009	PISA 2012	PISA 2015
Defining the School Mission	I use student performance results to develop the school’s educational goals.	I use student performance results to develop the school’s educational goals.	I use student performance results to develop the school’s educational goals.
	I ensure that teachers work according to the school’s educational goals.	I ensure that teachers work according to the school’s educational goals.	I ensure that teachers work according to the school’s educational goals.
	I check to see whether classroom activities are in keeping with our educational goals.	—	—
	—	I discuss the school’s academic goals with teachers at faculty meetings.	I discuss the school’s academic goals with teachers at faculty meetings.
Managing the Instructional Program	I give teachers suggestions as to how they can improve their teaching.	—	—
	I monitor students’ work.	—	—
	I observe instruction in classroom.	—	—
	I take exam results into account in decisions regarding curriculum development.	—	—
	I ensure that there is clarity concerning the responsibility for coordinating the curriculum.	—	—
	—	I draw teachers’ attention to the importance of pupil’s development of critical and social capacities.	I draw teachers’ attention to the importance of pupil’s development of critical and social capacities.
	—	I praise teachers whose students are actively participating in learning.	I praise teachers whose students are actively participating in learning.
	—	I promote teaching practices based on recent education research.	I promote teaching practices based on recent education research.
Developing the School Learning Climate	I make sure that the professional development activities of teachers are in accordance with the teaching goals of the school.	I make sure that the professional development activities of teachers are in accordance with the teaching goals of the school.	I make sure that the professional development activities of teachers are in accordance with the teaching goals of the school.
	I inform teachers about possibilities for updating their knowledge and skills.	—	—
	—	I engage teachers to help build a school culture of continuous improvement.	I engage teachers to help build a school culture of continuous improvement.
Median reliability	.821	.864	.856

Note. PISA = Program for International Student Assessment.

In the next step, we ran bifactor models to investigate the uni-dimensionality of the IL measure. The bifactor s−1 model (Eid et al., 2017; Pokropek et al., 2022) treating IL as the general factor with Defining the School Mission (as it was the most conceptually stable dimension across all three PISA cycles) as a reference category and two sub-dimensions (Managing the Instructional Program and Developing the School Learning Climate, respectively) was tested across each country and each cycle to compute the explained common variance (ECV), hierarchical Omega (OmegaH), and percent-age of uncontaminated correlations (PUC) to examine the uni-dimensionality, as suggested by Dueber and Toland (2023). If PUC ≥ 0.80, modeling uni-dimensionality is supported; if PUC ≤ 0.80, then ECV should be ≥0.60 and omegaH ≥ 0.70 for the overall score to support uni-dimenionality (Reise et al., 2013). The analyses were conducted with the R package Bif-actorIndicesCalculator (Dueber, 2021), and the model successfully converged (i.e., OmegaH < 1) in 76% of the cases, resulting in median values of 0.87, 0.83, and 0.89 for ECV, OmegaH, and PUC, respectively. These high bifactor indices, along with the presence of convergence issues (which in itself suggests that even modeling sub-scores is problematic), are considered evidence for a defensible use of the total IL score.

Consequently, in the final step, we conducted confirmatory factor analyses (CFA) to assess the construct fit of the uni-dimensional latent construct, named IL, across each country and each cycle. The IL factor was defined by 10 items in 2009 and 8 items in 2012 and 2015. The CFA were conducted with the R packages lavaan (Rosseel, 2012) and semTools (Jorgensen et al., 2021) across countries and cycles. To address missing data in these analyses, full information maximum likelihood was utilized (see Supple-mental Material in the online version of the journal). The items used, along with their assignment to the PIMRS dimensions, can be found in Table 1 (see Supplemental Material in the online version of the journal, for reliability and model fit statistics).

Moderators

As most studies on the associations between IL and student achievement have been conducted in the Anglophone world, and moreover, there are only a few studies that investigate several measurement points, we examined the generalizability of the findings in a further step of our analysis. Here we followed the suggestions by Cronbach (1982) and applied the UTOS (Units, Treatments, Observing Operations, and Settings) framework for generalizability. UTOS allows potential moderators to be systematized according to relevant areas that are responsible for the variation in effect sizes. The framework was extended by Becker (2017) to include a Method component, hence called MUTOS, which represents variation in research design between studies. S. Ahn et al. (2012) regarded MUTOS as a good practice to report meta-analyses, and Williams et al. (2022) stated that the M component is qualitatively different and added to the framework to represent the nuisance variation, for example, changes in the measurement approach. Controlling this is of particular importance in cross-cultural studies (Schaffer & Riordan, 2003; van de Vijver & He, 2017).

Since the psychometric characteristics of the IL measures might be affected by context (Eryilmaz & Sandoval Hernandez, 2021) and further might also interact with other model variables in the sense of a method factor (Becker, 2017), we include the standardized versions $(i . e ., N ~ (0, 1))$ of the following Method variables in our analyses to control for nuisance due to measurement of IL:

SRMR (M) of the IL scale indicates the model fit and was estimated based on the available PISA data for each country in each cycle. The SRMR values ranged between 0.031 and 0.163, with a mean value of 0.057 and a median value of 0.054. The lower SRMR values indicate a better model fit.

McDonald’s Omega (M) of the IL scale indicates the internal consistency of the construct and was calculated based on the available PISA data for each country in each cycle. The omega values (McDonald, 1999) ranged between 0.66 and 0.94, with a mean and median value of 0.84. Typically, reliability coefficients around .70 are considered adequate and larger values are considered either very good or excellent (Kline, 2016, p. 92).

Regarding the Unit component of MUTOS, we included the following characteristics of the samples (for details regarding both the rationales and analytical details, see Appendix).

Student–teacher ratio (U)

Reported in the available school-level PISA data, the student–teacher ratio (STRATIO) is aggregated for each country across cycles. In contrast to the other class size variable (CLSIZE), which is essentially a discrete variable collected as part of the school/principal questionnaire in PISA, the STRATIO variable is metric and based on direct information—that is, the number of students enrolled and the number of teachers on staff (Giambona & Porcu, 2018). The country average for this moderator ranged between 7.33 and 31.1, with a mean of 13.9 and distributed approximately normally with a median of 12.9.

Teacher shortage (U)

Computed similarly, the country average for teacher shortage (TCSHORT) was also distributed approximately normally, with mean and median values of 0 and ranging between −1.12 and 2.13.

Accountability pressure (U)

This variable is also available in the school-level PISA data (SC22Q01, SC19Q01, and SC036Q01TA) as a yes/no question. Originally, these variables were coded as 1 for yes and 2 for no. We subtracted 1, hence coding 0 for yes and 1 for no. The average accountability scores are computed for each country and each cycle. These scores were distributed with a slight negative skew and ranged between 0.07 and 0.99, with mean and median values of 0.6, in which a high score indicates that achievement data are less frequently made public.

Student SES (U)

Available in the student-level PISA data, the country-level averages of the student SES (i.e., ESCS in PISA) showed an approximately normal distribution, with −1.84 and 0.78 as the minimum and maximum values and −0.26 and −0.15 as the mean and median values.

Student sample size (U)

The PISA technical reports include country-level overall student sample sizes across cycles, ranging between 293 and 38,250. The values showed a positively skewed distribution, with a mean of 7,049 and median of 5,366.

School size (U)

Similar to student sample size, the technical reports provide the average within-school sample size for each country. These values ranged between 16.8 and 132, with a mean of 32.4 and a median of 27.8.

Share of private schools (U)

The technical reports provided the ratio of private schools, also distributed positively, with scores ranging between 0 and 93.3, a mean value of 17.6, and a median of 9.36.

Several studies have shown that both student achievement and school leadership may depend on the level of human development (e.g., Crede et al., 2019; Eriksson et al., 2021; Ramírez, 2006); macroeconomic factors such as gross domestic income (e.g., Hanushek & Woessmann, 2017; He et al., 2017; M. Lee & Hallinger, 2012; Stoet & Geary, 2013), wealth inequality within countries (e.g., Condron, 2011; M. Lee & Hallinger, 2012; Stoet & Geary, 2013; Thorson & Gearhart, 2018; Urick et al., 2022; van Ewijk & Sleegers, 2010), and government expenditure on education (e.g., Agasisti, 2014; Gupta et al., 2002); as well as cultural factors (e.g., Fang et al., 2013; Hanushek et al., 2020; He et al., 2017; M. Lee & Hallinger, 2012). Therefore, we consider the following Setting variables in our analyses:

Human development index (HDI) (S) measures a country’s development based on life expectancy and access to education, indicated by the expected years of schooling of children at school-entry age and mean years of schooling of the adult population, as well as gross national income per capita adjusted for the price level of the country. The raw HDI values ranged between 0.66 and 0.95 with a mean and median of 0.84.

Gross domestic product (GDP) (S) is the total monetary value of all the finished goods and services produced within a country’s borders in a single year. The values ranged between −0.81 and 2.24 with a mean of 0 and a median of −0.44.

GINI index (S) measures income or economic inequality within a country, with values ranging between 24.8 and 54.3 and a mean value of 35.

Government expenditure on education as a percentage of GDP (GEX) (S) represents the share of a country’s total government budget allocated to education activities. Values ranged from 2.10 to 9.51 with a mean and median value of 5.

Hofstede’s cultural dimensions (S) describe the influence of the culture embedded in a society on the values of the members of that society, as well as the connection with their behavior. Hofstede et al. (2010, p. 6) define culture as “the collective programming of the mind that distinguishes the members of one group or category of people from others.” Here, six dimensions that shape societies are distinguished: (a) power distance, (b) individualism (vs. collectivism), (c) masculinity (vs. femininity), (d) uncertainty avoidance, (e) long (vs. short) term orientation, and (f) indulgence (vs. restraint). All dimensions are scaled from 0 to 100.

Since we did not examine an intervention in the sense of Cronbach, we did not consider T variables. Furthermore, the SRMR and omega values could be treated as O variables; however, we decided to treat them as M variables to be included in each step of the model-building process.

Results

Overall Relation and Heterogeneity Between IL and Student Achievement

To answer our first research question, we conducted SAM analysis. In the analysis phase, measurement models of IL were first estimated for each country and cycle (see Supplemental Material in the online version of the journal). The fit statistic SRMR ranged from 0.031 to 0.162. For example, particularly good fit values were found in Vietnam, Greece, and Germany, and particularly poor fit values were found in Malta, Jordan, and Montenegro. Regarding the internal consistency of the IL scale, the omega values ranged from 0.656 to 0.944. We found that this also varied between nations; for example, the values were particularly high at 0.944 and 0.920 in Romania and Thailand, and particularly low at 0.656 and 0.699 in Jordan and Malta.

Second, we estimated correlations between IL and student achievement at the school level for each achievement domain nested in each cycle for each country. The correlations between IL and student achievement were positive on average in the first PISA cycle in 2009 (r = .033) and negative on average in the two subsequent cycles (r = −.019 and −.012). Furthermore, large differences in correlation coefficients were observed both between and within countries. The largest positive correlation coefficient was calculated for Luxembourg in the 2015 data set between math and IL (r = .436), but this correlation was lower in the 2009 data set (r = .239). The largest negative correlation was computed for Liechtenstein in the 2012 data set between reading and IL (r = −.478); however, the correlation was nearly 0 (r = .031) for Liechtenstein’s relation between science and IL in 2009. Overall, relatively high positive correlations (r = .10–.30) were observed for a few countries (e.g., Luxembourg and Peru) across 3 cycles, while the correlations were always negative or very close to 0 for some others (e.g., Australia and Belgium).

In the next phase, we conducted an integrative data analysis by applying a multilevel meta-analysis. Using schools as the corresponding sample size, the correlations were transformed using Fisher’s (1921) r-to-z formula. The unconditional 3-level model resulted in an average effect size of 0.005, with a lower boundary of −0.021 and an upper boundary of 0.030. The τ_L3 and τ_L2 were estimated as 0.082 and 0.116, respectively, hence τ = 0.142. The test for heterogeneity resulted in a Q value of 3,448.416 (df = 596, p < .001). To increase the readability of the forest plots (Kossmeier et al., 2020), domain-specific unconditional models were tested; for the association between IL and math achievement, the average effect size was −0.002 [−0.027, 0.023], for the reading achievement, it was 0.011 [−0.015, 0.036], and for the science achievement, it was 0.001 [−0.025, 0.027], as depicted in Figures 2 to 4. Overall, the heterogeneity was due to differences in effect sizes across cycles and countries but not due to achievement domains; in other words, the relation between IL and achievement was consistent across domains but varied from year to year and among countries. The heterogeneity can be seen in Figure 5, which depicts all 597 correlation coefficients and indicates no bias pattern.

Figure 2.

Forest plot for the math and IL association.

Figure 3.

Forest plot for the reading and IL association.

Figure 4.

Forest plot for the science and IL association.

Figure 5.

Funnel plot of the 597 studies included in the IPD meta-analysis.

Explanation of Heterogeneity Through Culture and Other Contextual Characteristics

We addressed our second research question by integrating MUTOS moderators into our three-level meta-analysis model. The first conditional model included moderators from subset M, and the SRMR and omega values from the CFA for the IL scale across each cycle were standardized and entered the model. There were no missing data for the methods moderators. The overall effect-size estimate was 0.005 (SE = 0.013, df = 71.7, p_S = .733). Compared with the unconditional model, heterogeneity was explained by less than 1%, τ_L3 = 0.085, τ_L2 = 0.114, and τ = 0.142. The test for residual heterogeneity resulted in a Q_E value of 3,423.027 (df = 594, p < .001). The results showed that neither of the moderators significantly explained the heterogeneity. As reported in Table 2, both the coeff-icient for the scale reliability (β_omega = −0.016, SE = 0.012, df = 30, p_S = .169) and the fit index (β_SRMR = −0.003, SE = 0.013, df = 20.9, p_S = .836) were not significantly different from 0.

Table 2

Meta-Regression Results for Methods Moderators

Moderators	β	SE	df	t	p_s
Intercept	0.005	0.013	71.7	0.343	.733
Omega	–0.016	0.012	30	–1.409	.169
SRMR	–0.003	0.013	20.9	–0.209	.836

Note. p_s = Satterthwaite adjusted p-values; SRMR = standardized root-mean-squared residual.

The second conditional model included moderators from subsets M and U; in addition to omega and SRMR, student–teacher ratio, teacher shortage, accountability pressure, student sample size, school size, and share of private schools were standardized and entered the model. Even though there were missing data for the U moderators at a negligible rate, multiple imputations were performed to prevent loss of information. The combined results indicated that, when all moderators held constant at their average (i.e., a value of 0 after standardization), the overall effect-size estimate was 0.003 (SE = 0.011, df = 61.7, p_S = .806). Compared with the unconditional model, heterogeneity was explained by 22%, τ_L3 = 0.046, τ_L2 = 0.116, and τ = 0.125. The average Q_E was 2,433.787. Table 3 reports the meta-regression results for M and U moderators, among which only 3 were flagged as statistically significant at .05 level: the scale reliability (β_omega = −0.023, SE = 0.010, df = 28.11, p_S = .031), the accountability pressure (β_{accountability} = 0.031, SE = 0.011, df = 34.35, p_S = .011), and SES (β_SES = −0.030, SE = 0.012, df = 18.70, p_S = .025).

Table 3

Meta-Regression Results for Methods and Units Moderators

Moderators	β	SE	df	t	p_s
Intercept	0.003	0.011	61.736	0.246	.806
Omega	–0.023	0.010	28.108	–2.276	.031
SRMR	–0.017	0.012	15.687	–1.430	.172
Student–teacher ratio	0.019	0.012	15.995	1.550	.141
Teacher shortage	0.018	0.011	25.052	1.668	.108
Accountability pressure	0.031	0.011	34.352	2.686	.011
SES	–0.030	0.012	18.704	–2.444	.025
Student sample	0.001	0.009	3.587	0.114	.915
School size	0.040	0.023	3.223	1.744	.173
Private school share	0.008	0.012	10.548	0.663	.521

Note. p_s = Satterthwaite adjusted p-values; SRMR = standardized root-mean-squared residual; SES = socioeconomic status.

The third conditional model included moderators from subsets M and S; in addition to omega and SRMR, GINI, GEX, HDI, GDP, and Hofstede measures were standardized and entered the model. The missing data rate for the S moderators reached 25%. The multiple imputation combined results indicated that, when all moderators held constant at their average, the overall effect-size estimate was −0.005 (SE = 0.012, df = 47.57, p_S = .679). Compared with the unconditional model, heterogeneity was explained by 20%, τ_L3 = 0.061, τ_L2 = 0.111, and τ = 0.127. The average Q_E was 2,339.796. As reported in Table 4, none of the moderators were flagged as significant except the scale reliability (β_omega = −0.026, SE = 0.012, df = 25.73, p_S = .045).

Table 4

Meta-Regression Results for Methods and Settings Moderators

Moderators	β	SE	df	t	p_s
Intercept	–0.005	0.012	47.566	–0.416	.679
Omega	–0.026	0.012	25.730	–2.112	.045
SRMR	–0.008	0.013	28.113	–0.578	.568
GINI	0.021	0.018	24.839	1.189	.246
GEX	–0.022	0.015	18.695	–1.423	.171
HDI	–0.028	0.019	18.799	–1.433	.168
GDP	0.000	0.017	26.681	0.002	.999
Hofstede measures:
Power distance	–0.004	0.020	16.899	–0.176	.862
Individualism	–0.031	0.020	22.055	–1.543	.137
Masculinity	–0.017	0.012	16.317	–1.453	.165
Uncertainty avoidance	0.003	0.011	15.961	0.248	.807
Long-term orientation	0.019	0.018	21.378	1.070	.297
Indulgence	0.026	0.019	14.845	1.386	.186

Note. p_s = Satterthwaite adjusted p-values. SRMR = standardized root-mean-squared residual; GEX = GEX = Government expenditure on education, total (% of GDP); HDI = human development index; GDP = gross domestic product.

The final conditional model for the MUTOS organizing process included important moderators from prior models. We used relatively liberal selection criteria; thus, the following moderators with an adjusted p-value smaller than .20 were included in the final model: omega; student–teacher ratio; teacher shortage; accountability pressure; SES; school size; HDI; GEX; and Hofstede measures of individualism, masculinity, and indulgence. Compared with the unconditional model, the final model explained 31.3% of the heterogeneity, τ_L3 = 0.028, τ_L2 = 0.114, and τ = 0.118. The average Q_E was 2,144.674. Table 5 reports results from the final model; accountability pressure (β_{accountability} = 0.033, SE = 0.010, df = 33.05, p_S = .003), HDI (β_HDI = −0.035, SE = 0.016, df = 13.72, p_S = .046), and masculinity (β_{Hof_mas} = −0.025, SE = 0.011, df = 12.62, p_S = .041) were significant at p < .05.

Table 5

Meta-Regression Results for the Final Model

Moderators	β	SE	df	t	p_s
Intercept	0.001	0.010	51.955	0.058	.954
Omega	–0.015	0.010	25.123	–1.436	.163
Student–teacher ratio	0.019	0.012	12.928	1.560	.143
Teacher shortage	0.018	0.011	23.977	1.658	.110
Accountability pressure	0.033	0.010	33.051	3.170	.003
SES	0.012	0.018	20.210	0.670	.510
School size	0.028	0.024	2.869	1.164	.332
HDI	–0.035	0.016	13.720	–2.195	.046
GEX	–0.027	0.014	20.930	–1.966	.063
Hofstede measures:
Individualism	–0.014	0.016	25.472	–0.926	.363
Masculinity	–0.025	0.011	12.623	–2.281	.041
Indulgence	0.007	0.013	19.541	0.505	.619

Note. p_s = Satterthwaite adjusted p-values. SES = socioeconomic status; HDI = human development index; GEX = Government expenditure on education, total (% of GDP).

Robustness Check With Meta-CART

Finally, we compared the theory-driven MUTOS approach with a data-driven regression-tree analysis by using Meta-CART. A total of 12 predictors entered the random-effects model, with 11 predictors from the final model and the year as the cycle indicator. For the 100 imputed data sets, 100 Meta-CART analyses were completed, and each predictor was listed at least 32 times as an important moderator (see Table 6). Among the moderators that were listed as important, HDI, accountability pressure, individualism, and teacher shortage appeared in 98, 96, 86, and 82 of the 100 analyses, respectively. Specifically, either HDI or individualism was listed as the first parent leaf in 100 analyses, indicating that the human development and cultural aspects are the most relevant moderators with regard to the IL–student achievement relation. Even though individualism was not listed as a significant moderator in the MUTOS process, it appeared 74 times as the first or second parent leaf in the Meta-CART analyses, indicating that it formed interactions with other variables. For example, as depicted in Figure 6, the Meta-CART analyses of an imputed data set list HDI as the first parent leaf; as it interacted with individualism, lower individualism values along with lower HDI values resulted in two different subgroups, both with an average effect size larger than 0 (g = 0.203 and g = 0.133), with Hedges’s g values < 0.20 indicating small effects, values around 0.50 indicating moderate effects, and values about 0.80 indicating large effects (Cohen, 2013).

Table 6

Important Moderators and Their Appearance in 100 Meta-CART Analyses

Moderators	Total frequency	First leaf frequency	Second leaf frequency	Other leaf frequency
Omega	80	—	—	80
Student–teacher ratio	59	—	—	59
Teacher shortage	82	—	2	80
Accountability pressure	96	—	34	62
SES	56	—	—	56
School size	40	—	—	40
HDI	98	38	44	16
GEX	71	—	—	71
Hofstede measures:
Individualism	89	62	12	15
Masculinity	32	—	—	32
Indulgence	56	—	8	48
Cycle	68	—	—	68

Note. CART = Classification and Regression Tree; SES = socioeconomic status; HDI = human development index; GEX = Government expenditure on education, total (% of GDP).

Figure 6.

An example plot and Hedges’ g values for one of the Meta-CART analyses.

Discussion

Through this study, we investigated the relationship between IL and student achievement in 75 countries across different cultures and a 6-year time span. Based upon data from more than 1.5 million students nested in more than 50,000 schools across the globe, we demonstrated with the aid of a novel meta-analytical big data analysis approach that the relation between IL and student achievement in the domains of reading, mathematics, and science, with a mean value of 0.005, is significantly lower than the effect sizes estimated in traditional meta-analyses. Furthermore, our analyses demonstrate that the relation between IL and student achievement is context-dependent, as the effect sizes ranged from −0.48 to 0.44 for different countries in different cycles. Accordingly, the findings on the investigated relationship cannot be generalized in principle, as they are not etic. It is nonetheless striking that the heterogeneity in the data can only be attributed to a very small extent to achievement domains. On the one hand, this means that our study did not confirm findings from other studies that reported varying correlations in this regard (Tan et al., 2021). On the other hand, the time of measurement and/or the time of data collection, along with the location of data collection, have a major influence on the correlation between IL and student achievement, indicating that, even within countries, the findings on the relationship between these two variables are hardly stable over time.

The most important finding from the first part of the analysis is a relatively weak relationship between IL and student achievement on average compared with the previous meta-analyses. First, previous aggregated data meta-analyses were mostly limited to studies from the core Anglosphere, while our analysis covers large numbers of countries with various educational and socioeconomic contexts. Therefore, the more conflicting results largely canceled each other out. Another reason could be the understanding of student achievement in most of the existing studies and in PISA. While student achievement is often measured based on school grades or national exams based on school curricula, “PISA examines how well students are prepared to meet the challenges of the future, rather than how well they master particular curricula” (OECD, 2014, p. 3). Therefore, school principals’ IL practices may not be directly transferable to students’ PISA performance. Future research, therefore, might investigate the nature of IL and its relation to student achievement based on various orientations.

By applying the MUTOS (an extended form of UTOS for meta-analyses) approach, meta-regressions revealed which country-specific contextual factors threaten the generalizability of the IL–student achievement relation. The key findings here include that (a) considerable variability stems from the level of human development in a country; (b) cultural factors have a significant influence; and (c) accountability pressure is inversely related to the coupling of IL and student achievement, meaning that, in school systems with a strong emphasis on publicly reporting the outcomes of schooling, associations become looser. A fully data-driven machine-learning regression-tree analysis (CART) supported these findings and demonstrated that the level of human development and cultural factors are the main factors limiting the possibility of generalization. This means that all other potentially influencing variables are subordinate to, respectively, interact with these two factors—that is, they exert an influence on the relationship between IL and student achievement in dependence on them. However, the main motives and mechanisms behind the roles of human development and cultural factors deserve specific attention and more in-depth analyses. Therefore, we suggest that scholars pay greater attention to these aspects in future research.

However, it is also striking that the results within countries were hardly stable over time, resulting in an overall heterogeneity formed dominantly by within-country differences, rather than between-country variation. Furthermore, the within- and between-country variance values were significantly higher than those reported in other ILSA-based IPD meta-analyses that examine individual-level effects—that is, the relationship between individual students’ SES and student achievement (Brunner et al., 2022). This suggests that the relationship between IL and student achievement at the school level is context-sensitive, rather than a universal phenomenon, and that the results are also dependent on the samples, instruments, and measures used. This result is even more surprising given that (a) the sampling design of PISA should reduce sample selection bias, (b) the large sample size allows for very precise effect-size estimates, and (c) we controlled for possible unreliability of the measures and contextual factors in the analyses. Therefore, in line with Campos et al. (2023), we argue that future studies should investigate in more detail the effects (or biases) in IPD meta-analyses arising from heterogeneity between samples, the types and characteristics of the measures, and the sampling procedures involved.

Limitations and Areas for Future Research

Many of the key limitations of our study are due to the data we used. First, PISA data are cross-sectional, thus preventing us from drawing causal conclusions. Second, PISA has a focus on the OECD and associated countries. Therefore, the data are far from allowing generalizable statements to be made on a global level, as large parts of Africa, for example, do not participate in the study (van de Vijver et al., 2010). Third, the school sampling in PISA means that, in small countries, data were collected repeatedly in the same schools during all three cycles, whereas in nations with many schools, data were rarely collected within the same schools over time. Thus, in some countries, there nearly exist longitudinal data at the school level, while, in others, there is not, which may affect the stability of effects over time. Unfortunately, we cannot control this based on the available data. Fourth, although we made great efforts to validate the items and scales for IL, we were only able to handle method heterogeneity to a limited extent due to the items and scales changed by the OECD between cycles. Therefore, the use of raters was the only effective way to ensure conceptual comparability of the scales between countries and cycles. However, whenever possible, statistical comparability should be tested (Campos et al., 2023). This also applies to PISA’s (original) school leadership scale, as it is already evident that a high level of invariance between countries is usually not achieved within individual PISA cycles, and consequently “comparisons made across countries using this scale should be made with caution” (Eryilmaz & Sandoval Hernandez, 2021, p. 17). Fifth, the information on IL comes from school leaders’ self-reports, which may limit the validity of the data. Sixth, this study focuses on only one aspect of school leadership, IL, since there were not enough items in the existing PISA cycles to form scales for other common approaches to school leadership.

In addition, there were other methodological challenges that are important to note. The first methodological challenge was the missing data at the country level, which caused the S moderators to reach up to 25%. We used joint modeling and 100 imputations to address the missing data, as Leite et al. (2021) showed that joint modeling performs satisfactorily with 10 predictors, a 30% missing data rate, and only 10 multiple imputations to detect a treatment effect. The second challenge was implementing Meta-CART for a nested data structure, which we could only partially address by introducing cycle as a moder-ator. The Meta-CART analyses produced consistent findings compared with the MUTOS results, as HDI and accountability pressure were listed as relevant moderators. However, Meta-CART revealed that Hofstede’s cultural measures might form interactions that were not evident in the MUTOS results. We are sharing the relevant codes, data sets, and additional graphs (see Supplemental Material in the online version of the journal) to help others further explore the heterogeneity in the relation between IL and school-level components of student achievement.

Given these limitations and considering the results of our study, we recommend that differences between countries, as well as between measurement occasions within countries, should be considered when surveying and analyzing the relationship between IL and student achievement. We also recommend paying more attention to method heterogeneity in the context of traditional meta-analyses and following, for example, the approach of Schmidt and Hunter (2015). This is because findings from individual studies do not seem to be generalizable either within a country or between countries, according to our results—not even when comparable analysis protocols and methods are used. In light of this fact, the question then arises as to how to strengthen the significance and usability of ILSAs for cross-cultural research through the stronger inclusion of emic factors. In this context, one may heed the suggestions of F. M. Cheung et al. (2011) to already consider emic and etic aspects when developing IL measures. This is also the first route mentioned by He and van der Vijver (2015), and the only one we were unable to take with the available data. For instance, by using cross-cultural multi-matrix designs (e.g., Gonzalez & Rutkowski, 2010; Kaplan & Su, 2016), etic items could be used to generalizable measure the construct on the one hand, and, on the other hand, depending on the context, emic items could be used that take respective cultural uniqueness into account.

Considering that, according to our data, the level of human development is an important determinant of the relationship between IL and student achievement, it is important to investigate how this plays out in countries that do not fall within the OECD framework (van de Vijver et al., 2010)—that is, countries that are less developed and have a low HDI. In this regard, studies show that aspects of human development and culture are partially interconnected and interdependent (Gamlath, 2017; Pereira Sartori Falguera et al., 2021), a finding that our Meta-CART analyses demonstrate too in terms of moderating joint effects in school settings. As in other meta-analyses (Crede et al., 2019), leadership effectiveness was also inversely related to human development in our study: the higher the HDI, the weaker the relationship between IL and student achievement. Since the concept of human development focuses on two simultaneous processes, namely, (a) the building of human capabilities and (b) the ways people use these capabilities to function in society and freely choose between different options (Dasic et al., 2020), this suggests that IL may be particularly effective in contexts where people have fewer “real freedoms” (Sen, 2001, p. 36) to achieve actual functioning, such as being and doing (Sen, 1995). However, since learning in itself can lead to an increase in such freedoms (Nussbaum, 2006), that is, by promoting “reason, personal autonomy, and independence” (Flores-Crespo, 2007, p. 54), one might assume that IL could be a tool for enhancing human capabilities and that its effectiveness diminishes with increasing human development. Clarifying whether this is the case could be another important research topic.

Since we only considered correlations in this study, but IL can be assumed to have an indirect effect on student achievement, our finding of the overall correlation being close to zero indicates no direct linear relation. Thus, it would be worthwhile to conduct further comparable studies using more complex meta-analytical structural equation models (Jak & Cheung, 2020). The challenge will be to systematically include theoretically well-justified mediator variables in such a model. Scheerens (2012) shows that the variables included in such models so far have been very heterogeneous. Consequently, the effect sizes in direct and indirect models hardly differ from each other in his meta-analysis, which makes him argue: “A key question that indirect effect models address is the question whether a stable set of intermediary school conditions can be identified” (Scheerens, 2012, p. 133). In our view, it is particularly useful to incorporate variables on classroom instruction/teaching, as this is by definition the main object of IL, and data on the subject have been collected in almost all PISA cycles (OECD, 2014); surprisingly, though, this variable has rarely been used in indirect effect models so far (Scheerens, 2012).

In addition, it should be noted that the PISA sample includes only secondary schools, whereas the available aggregated data meta-analyses on this topic are mostly based on primary studies conducted in elementary schools (see, for example, Robinson et al., 2008). This may also explain the differences. As the results of our ILSA-based IPD meta-analysis differ from those of traditional aggregated data meta-analyses, we further propose to integrate both types of data in a two-stage meta-analysis, with random effects varying between effect sizes, to explore the sensitivity of the data and measurement between IPD and aggregated data meta-analysis approaches (Campos et al., 2023; Scherer et al., 2021).

Finally, because of the findings presented, it seems important to also consider measurement occasions, especially time, as a context variable in further studies.

Conclusion and Implications

In conclusion, our study makes it clear that previously reported findings on the relationship between IL and student achievement should be taken with caution, as it is likely that the results are not generalizable at the global scale. IL and student achievement are not positively correlated per se, and their relationship varies greatly depending on the context of whether a school leader’s focus on teaching and instruction resonates with students. Consequently, the implications for policy and practice are manifold. On the one hand, policy borrowing is not advisable; after all, what works in one context does not necessarily work in another. Since we found negative correlations in some cases, we cannot rule out the possibility that IL may even have counterintuitive effects in certain contexts (although PISA data do not allow us to make such statements due to cross-sectionality). On the other hand, in practice, our findings do not mean that it is useless for school leaders to focus on the core business of school: instruction. At the same time, we should not expect a magic bullet here when it comes to student achievement. Moreover, one should constantly evaluate whether IL actually achieves the intended goals under the respective contextual conditions and adjust leadership practices accordingly.

Our study is one of the few large-scale quantitative cross-cultural studies of educational leadership to date. Accordingly, we present findings on the relationship between IL and student achievement for the first time for many countries and regions; this applies particularly to countries in Central Asia, South America, and Southeast Europe. The results of our study are therefore highly significant and original, providing empirical results where previously there were only frameworks for action or small-scale studies (i.e., Sindhvad et al., 2020, 2022). In doing so, our article also addresses another issue, namely, that systematic knowledge production in educational leadership is often rooted in and connected to a select few influential knowledge producers (Hallinger & Kovačević, 2019), and that existing best evidence reviews and analyses are often products of this relationship (McGinty et al., 2022). The analysis of the freely available PISA data makes it possible to generate findings independently of this dominance. Our findings are therefore timely, as issues of effective school leadership are coming to the fore in many countries (Gümüş et al., 2018; Hallinger et al., 2020), and highlight the need to consider IL in its specific context.

In addition, our study has implications regarding the interpretation and implementation of PISA and other ILSAs for policy, practice, and education research, particularly when providing information for educational practice and research (Oh, 2020) and evidence-based decision-making in education (Slavin, 2008). First, using various items for the same construct (e.g., school leadership) over cycles without any clear conceptual foundation makes it hard to interpret changes and make meaningful comparisons. In addition, given the increasing research and policy attention to other leadership models, such as distributed leadership and transformational leadership (Leithwood & Sun, 2012; Spillane, 2012), as well as integrated views (Printy et al., 2009), the school leadership questionnaires used by the OECD seem quite narrowly designed, as they are mostly based on a single leadership approach, IL.

We, therefore, suggest that the OECD should use more concrete frameworks of various leadership models and strive for consistency in the application of questionnaires across different cycles and studies (e.g., TALIS and PISA). This is particularly important to empirically test the measurement invariance between cycles and studies, which in turn contributes to the robustness of results in IPD meta-analyses. Furthermore, moving beyond cross-sectional data sets and designing longitudinal ones is advisable for both governments and international organizations, since cross-sectional data do not always provide insights into the complex relationships between various organizational variables (which usually require a longer time to create influence) and student achievement. The self-reporting nature of leadership scales could also be an issue because of the tendency for self-presentation or extreme responses, as discussed in the literature (Paulhus & Vazire, 2007). This becomes even more problematic for cross-national studies since such tendencies might also be sensitive to cultural differences (Guo & Lu, 2018; McNabb, 1990). We suggest collecting data from multiple sources (e.g., principals and teachers), so scholars could use various designs, including the self-other agreements approach (Atwater & Yammarino, 1992; Ham et al., 2015).

Supplemental Material

sj-pdf-1-epa-10.3102_01623737231197434 – Supplemental material for Putting the Instructional Leadership–Student Achievement Relation in Context: A Meta-Analytical Big Data Study Across Cultures and Time

Supplemental material, sj-pdf-1-epa-10.3102_01623737231197434 for Putting the Instructional Leadership–Student Achievement Relation in Context: A Meta-Analytical Big Data Study Across Cultures and Time by Marcus Pietsch, Burak Aydin and Sedat Gümüş in Educational Evaluation and Policy Analysis

Supplemental Material

sj-csv-2-epa-10.3102_01623737231197434 – Supplemental material for Putting the Instructional Leadership–Student Achievement Relation in Context: A Meta-Analytical Big Data Study Across Cultures and Time

Supplemental material, sj-csv-2-epa-10.3102_01623737231197434 for Putting the Instructional Leadership–Student Achievement Relation in Context: A Meta-Analytical Big Data Study Across Cultures and Time by Marcus Pietsch, Burak Aydin and Sedat Gümüş in Educational Evaluation and Policy Analysis

Footnotes

Appendix

Declaration of Conflicting Interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: Marcus Pietsch is supported by a Heisenberg professorship of the German Research Association (DFG; Project ID: 451458391, PI 618/4-1).

ORCID iDs

Marcus Pietsch

Burak Aydin

Sedat Gümüş

Authors

MARCUS PIETSCH, PhD, is a DFG Heisenberg professor at the Leuphana University of Lüneburg, Germany. His research uses quantitative methods to explore innovation and change in schools and education systems, and how school leaders can contribute to creating better schools for all.

BURAK AYDIN, PhD, is a faculty member at the Ege University, Türkiye, and a researcher at the Leuphana University of Lüneburg, Germany. He is a quantitative data analyst, and his research mainly focuses on the theory and application of multilevel modeling, structural equation modeling, and propensity score analyses.

SEDAT GÜMÜŞ, PhD, is an associate professor in the Department of Educational Policy and Leadership at The Education University of Hong Kong. His current research focuses on the relationship between school leadership and various teacher and student outcomes as well as the contextualization of leadership models and practices.

References

Agasisti

(2014). The efficiency of public spending on education: An empirical comparison of EU countries. European Journal of Education, 49(4), 543–557. https://doi.org/10.1111/ejed.12069

Ahn

Bowers

A. J.

Welton

A. D.

(2021). Leadership for learning as an organization-wide practice: Evidence on its multilevel structure and implications for educational leadership practice and research. International Journal of Leadership in Education. Advance online publication. https://doi.org/10.1080/13603124.2021.1972162

Ahn

Ames

A. J.

Myers

N. D.

(2012). A review of meta-analyses in education: Method-ological strengths and weaknesses. Review of Educational Research, 82(4), 436–476. https://doi.org/10.3102/0034654312458162

Antonakis

Schriesheim

C. A.

Donovan

J. A.

Gopalakrishna-Pillai

Pellegrini

E. K.

Rossomme

J. L.

(2004). Methods for studying leadership. In Antonakis

Cianciolo

A. R.

Sternberg

R. J.

(Eds.), The nature of leadership (pp. 48–70). Sage.

Arıkan

Özer

Şeker

Ertaş

(2020). The importance of sample weights and plausible values in large-scale assessments. Journal of Measurement and Evaluation in Education and Psychology, 11(1), 43–60. https://doi.org/10.21031/epod.602765

Arnett

J. J.

(2008). The neglected 95%: Why American psychology needs to become less American. American Psychologist, 63(7), 602–614. https://doi.org/10.1037/0003-066X.63.7.602

Atwater

L. E.

Yammarino

F. J.

(1992). Does self-other agreement on leadership perceptions moderate the validity of leadership and performance predictions? Personnel Psychology, 45(1), 141–165. https://doi.org/10.1111/j.1744-6570.1992.tb00848.x

Barnard

Rubin

D. B.

(1999). Miscellanea. Small-sample degrees of freedom with multiple imputation. Biometrika, 86(4), 948–955. https://doi.org/10.1093/biomet/86.4.948

Becker

B. J.

(2017). Improving the design and use of meta-analyses of career interventions. In Sampson

J. P.

Bullock-Yowell

Dozier

V. C.

Osborn

D. S.

Lenz

J. G.

(Eds.), Integrating theory, research, and practice in vocational psychology: Current status and future directions (pp. 95–107). Florida State University Libraries. https://doi.org/10.17125/svp2016.ch10

10.

Berry

J. W.

(1969). On cross-cultural comparability. International Journal of Psychology, 4(2), 119–128. https://doi.org/10.1080/00207596908247261

11.

Berry

J. W.

(1989). Imposed etics—emics—derived etics: The operationalization of a compelling idea. International Journal of Psychology, 24(2–6), 721–735. https://doi.org/10.1080/00207598908247841

12.

Blömeke

Nilsen

Scherer

(2021). School innovativeness is associated with enhanced teacher collaboration, innovative classroom practices, and job satisfaction. Journal of Educational Psychology, 113(8), 1645–1667. https://doi.org/10.1037/edu0000668

13.

Böhlmark

Lindahl

(2015). Independent schools and long-run educational outcomes: Evidence from Sweden’s large-scale voucher reform. Economica, 82(327), 508–551. https://doi.org/10.1111/ecca.12130

14.

Bossert

S. T.

Dwyer

D. C.

Rowan

Lee

G. V.

(1982). The instructional management role of the principal. Educational Administration Quarterly, 18(3), 34–64. https://doi.org/10.1177/0013161X82018003004

15.

Bosworth

(2014). Class size, class composition, and the distribution of student achievement. Education Economics, 22(2), 141–165. https://doi.org/10.1080/09645292.2011.568698

16.

Bowers

A. J.

(2020). Examining a congruency-typology model of leadership for learning using two-level latent class analysis with TALIS 2018 (OECD Education Working Papers, No. 219). OECD Publishing. https://doi.org/10.1787/c963073b-en

17.

Boyce

Bowers

A. J.

(2018). Toward an evolving conceptualization of instructional leadership as leadership for learning: Meta-narrative review of 109 quantitative studies across 25 years. Journal of Educational Administration, 56(2), 161–182. https://doi.org/10.1108/JEA-06-2016-0064

18.

Brauckmann

Geißler

Feldhoff

Pashiardis

(2016). Instructional leadership in Germany: An evolutionary perspective. International Studies in Educational Administration, 44(2), 5–20.

19.

Bravata

D. M.

Olkin

(2001). Simple pooling versus combining in meta-analysis. Evaluation & the Health Professions, 24(2), 218–230. https://doi.org/10.1177/01632780122034885

20.

Brett

J. M.

Tinsley

C. H.

Janssens

Barsness

Z. I.

Lytle

A. L.

(1997). New approaches to the study of culture in industrial/organizational psychology. In Earley

P. C.

Erez

(Eds.), New perspectives on international industrial/ organizational psychology (pp. 75–129). The New Lexington Press/Jossey-Bass Publishers.

21.

Bridges

E. M.

(1967). Instructional leadership: A concept re-examined. Journal of Educational Administration, 5(2), 136–147. https://doi.org/10.1108/eb009614

22.

Brieve

F. J.

(1972). Secondary principals as instructional leaders. NASSP Bulletin, 56(368), 11–15. https://doi.org/10.1177/019263657205636802

23.

Brookover

W. B.

Beady

Flood

Schweitzer

Wisenbaker

(1979). School social systems and student achievement. Schools can make a difference. Praeger.

24.

Brunner

Keller

Stallasch

S. E.

Kretschmann

Hasl

Preckel

Lüdtke

Hedges

L. V.

(2022). Meta-analyzing individual participant data from studies with complex survey designs: A tutorial on using the two-stage approach for data from educational large-scale assessments. Research Synthesis Methods, 14, 5–35. https://doi.org/10.1002/jrsm.1584

25.

Bush

(2022). Reviewing fifty years of EMAL Scholarship: Longitudinal perspectives on the journal and the field of Educational Leadership and Management. Educational Management Administration & Leadership, 50(2), 187–191. https://doi.org/10.1177/17411432221077767

26.

Campos

D. G.

Cheung

M. W. L.

Scherer

(2023). A primer on synthesizing individual participant data obtained from complex sampling surveys: A two-stage IPD meta-analysis approach. Psychological Methods. Advance online publication. https://doi.org/10.1037/met0000539

27.

Carbonaro

Covay

(2010). School sector and student achievement in the era of standards based reforms. Sociology of Education, 83(2), 160–182. https://doi.org/10.1177/0038040710367934

28.

Carnoy

Loeb

(2002). Does external accountability affect student outcomes? A cross-state analysis. Educational Evaluation and Policy Analysis, 24(4), 305–331. https://doi.org/10.3102/01623737024004305

29.

Chalmers

(1993). The Cochrane collaboration: Preparing, maintaining, and disseminating systematic reviews of the effects of Health Care. Annals of the New York Academy of Sciences, 703(1), 156–165. https://doi.org/10.1111/j.1749-6632.1993.tb26345.x

30.

Cheung

F. M.

van de Vijver

F. J. R.

Leong

F. T. L.

(2011). Toward a new approach to the study of personality in culture. American Psychologist, 66(7), 593–603. https://doi.org/10.1037/a0022389

31.

Cheung

M. W. L.

Jak

(2016). Analyzing big data in psychology: A split/analyze/meta-analyze approach. Frontiers in Psychology, 7, Article 738. https://doi.org/10.3389/fpsyg.2016.00738

32.

Clarke

O’Donoghue

(2017). Educational leadership and context: A rendering of an inseparable relationship. British Journal of Educational Studies, 65(2), 167–182. https://doi.org/10.1080/00071005.2016.1199772

33.

Cohen

(2013). Statistical power analysis for the behavioral sciences. Academic press.

34.

Condron

D. J.

(2011). Egalitarianism and educational excellence: Compatible goals for affluent societies? Educational Researcher, 40(2), 47–55. https://doi.org/10.3102/0013189X11401021

35.

Crede

Jong

Harms

(2019). The generalizability of transformational leadership across cultures: A meta-analysis. Journal of Managerial Psychology, 34(3), 139–155. https://doi.org/10.1108/jmp-11-2018-0506

36.

Cronbach

L. J.

(1982). Designing evaluations of educational and social programs. Jossey-Bass.

37.

Curran

P. J.

Hussong

A. M.

(2009). Integrative data analysis: The simultaneous analysis of multiple data sets. Psychological Methods, 14(2), 81–100. https://doi.org/10.1037/a0015914

38.

Daniëls

Hondeghem

Dochy

(2019). A review on leadership and leadership development in educational settings. Educational Research Review, 27, 110–125. https://doi.org/10.1016/j.edurev.2019.02.003

39.

Dasic

Devic

Denic

Zlatkovic

Ilic

I. D.

Cao

Jermsittiparsert

H. V.

(2020). Human development index in a context of human development: Review on the Western Balkans countries. Brain and Behavior, 10(9), Article e01755. https://doi.org/10.1002/brb3.1755

40.

Day

Sammons

(2016). The impact of leadership on student outcomes: How successful school leaders use transformational and instructional strategies to make a difference. Educational Administration Quarterly, 52(2), 221–258. https://doi.org/10.1177/0013161X15616863

41.

Debray

Moons

K. G.

van Valkenhoef

Efthimiou

Hummel

Groenwold

R. H.

Reitsma

J. B.

(2015). Get real in individual participant data (IPD) meta-analysis: A review of the methodology. Research Synthesis Methods, 6(4), 293–309. https://doi.org/10.1002/jrsm.1160

42.

Den Hartog

D. N.

House

R. J.

Hanges

P. J.

Ruiz-Quintanilla

S. A.

Dorfman

P. W.

Abdalla

I. A.

Adetoun

B. S.

Aditya

R. N.

Agourram

Akande

B. E.

Akerblom

Altschul

Alvarez-Backus

Andrews

Arias

M. E.

Arif

M. S.

Ashkanasy

N. M.

Asllani

. . . Zhou

(1999). Culture specific and cross-culturally generalizable implicit leadership theories. The Leadership Quarterly, 10(2), 219–256. https://doi.org/10.1016/s1048-9843(99)00018-1

43.

Dickson

M. W.

Castaño

Magomaeva

Den Hartog

D. N.

(2012). Conceptualizing leadership across cultures. Journal of World Business, 47(4), 483–492. https://doi.org/10.1016/j.jwb.2012.01.002

44.

Dimmock

Walker

(1998). Comparative educational administration: Developing a cross-cultural conceptual framework. Educational Admini-stration Quarterly, 34(4), 558–595. https://doi.org/10.1177/0013161X98034004006

45.

Dimmock

Walker

(2005). Educational leadership: Culture and diversity. Sage. https://doi.org/10.4135/9781446247143

46.

Dueber

D. M.

(2021). Bifactor Indices Calculator (R Package Version 0.2.2). https://github.com/ddueber/BifactorIndicesCalculator

47.

Dueber

D. M.

Toland

M. D.

(2023). A bifactor approach to subscore assessment. Psychological Methods, 28(1), 222–241. https://doi.org/10.1037/met0000459

48.

Edmonds

(1979). Effective schools for the urban poor. Educational Leadership, 37(1), 15–24.

49.

Egalite

A. J.

Kisida

(2016). School size and student achievement: A longitudinal analysis. School Effectiveness and School Improvement, 27(3), 406–417. https://doi.org/10.1080/09243453.2016.1190385

50.

Ehrenberg

R. G.

Brewer

D. J.

Gamoran

Willms

J. D.

(2001). Class size and student achievement. Psychological Science in the Public Interest, 2(1), 1–30. https://doi.org/10.1111/1529-1006.003

51.

Eid

Geiser

Koch

Heene

(2017). Anomalous results in G-factor models: Explanations and alternatives. Psychological Methods, 22(3), 541–562. https://doi.org/10.1037/met0000083

52.

Eriksson

Lindvall

Helenius

Ryve

(2021). Socioeconomic status as a multidimensional predictor of student achievement in 77 societies. Frontiers in Education, 6, 1–10. https://doi.org/10.3389/feduc.2021.731634

53.

Eryilmaz

Sandoval Hernandez

(2021). Improving cross-cultural comparability: Does school leadership mean the same in different countries? Educational Studies. Advance online publication. https://doi.org/10.1080/03055698.2021.2013777

54.

Fang

Grant

L. W.

Stronge

J. H.

Ward

T. J.

(2013). An international comparison investigating the relationship between national culture and student achievement. Educational Assessment, Evaluation and Accountability, 25(3), 159–177. https://doi.org/10.1007/s11092-013-9171-0

55.

Fertig

(2003). Who’s to blame? The determinants of German students’ achievement in the Pisa 2000 study. SSRN Electronic Journal, 739, 1–16. https://doi.org/10.2139/ssrn.392040

56.

Fischman

G. E.

Topper

A. M.

Silova

Goebel

Holloway

J. L.

(2019). Examining the influence of international large-scale assessments on national education policies. Journal of Education Policy, 34(4), 470–499. https://doi.org/10.1080/02680939.2018.1460493

57.

Fisher

R. A.

(1921). On the “probable error” of a coefficient of correlation deduced from a small sample. Metron, 1, 1–32. https://hdl.handle.net/2440/15169

58.

Flores-Crespo

(2007). Situating education in the human capabilities approach. In Walker

Unterhalter

(Eds.), Amartya Sen’s capability approach and social justice in education (pp. 45–65). Palgrave Macmillan. https://doi.org/10.1057/9780230604810_3

59.

Forrester

Gunter

H. M.

(2010). New headteachers in schools in England and their approaches to leadership. In Shoho

A. R.

Barnett

B. G.

Tooms

A. K.

(Eds.), The challenges for new principals in twenty-first century: Developing leadership capabilities through professional support (pp. 29–49). Information Age Publishing.

60.

Gamlath

(2017). Human development and national culture: A multivariate exploration. Social Indicators Research, 133, 907–930. https://doi.org/10.1007/s11205-016-1396-0

61.

Giambona

Porcu

(2018). School size and students’ achievement. Empirical evidences from PISA survey data. Socio-economic Planning Sciences, 64, 66–77. https://doi.org/10.1016/j.seps.2017.12.007

62.

Gonzalez

Rutkowski

(2010). Principles of multiple matrix booklet designs and parameter recovery in large-scale assessments. IEA-ETS Research Institute Monograph, 3, 125–156.

63.

Gray

W. S.

(1934). Evidence of the need of capable instructional leadership. The Elementary School Journal, 34(6), 417–426. https://doi.org/10.1086/457045

64.

Grissom

J. A.

Egalite

A. J.

Lindsay

C. A.

(2021). How principals affect students and schools: A systematic synthesis of two decades of research. The Wallace Foundation. https://www.wallacefoundation.org/principalsynthesis

65.

Gümüş

Bellibaş

M. S.

Esen

Gümüş

(2018). A systematic review of studies on leadership models in educational research from 1980 to 2014. Educational Management Administration & Leadership, 46(1), 25–48. https://doi.org/10.1177/1741143216659296

66.

Gümüş

Bellibas

M. S.

Pietsch

(2022). School leadership and achievement gaps based on socioeconomic status: A search for socially just instructional leadership. Journal of Educational Administration, 60(4), 419–438. https://doi.org/10.1108/JEA-11-2021-0213

67.

Gümüş

Hallinger

Cansoy

Bellibaş

M. Ş

. (2021). Instructional leadership in a centralized and competitive educational system: A qualitative meta-synthesis of research from Turkey. Journal of Educational Administration, 59(6), 702–720h. https://doi.org/10.1108/JEA-04-2021-0073

68.

Guo

(2018). Assessing instructional leadership from two mindsets in China: Power distance as a moderator. Educational Assessment, Evaluation and Accountability, 30(4), 433–455. https://doi.org/10.1007/s11092-018-9287-3

69.

Gupta

Verhoeven

Tiongson

E. R.

(2002). The effectiveness of government spending on education and health care in developing and transition economies. European Journal of Political Economy, 18(4), 717–737. https://doi.org/10.1016/s0176-2680(02)00116-7

70.

Hallinger

(2005). Instructional leadership and the school principal: A passing fancy that refuses to fade away. Leadership and Policy in Schools, 4(3), 221–239. https://doi.org/10.1080/15700760500244793

71.

Hallinger

(2011a). Leadership for learning: Lessons from 40 years of empirical research. Journal of Educational Administration, 49(2), 125–142. https://doi.org/10.1108/09578231111116699

72.

Hallinger

(2011b). A review of three decades of doctoral studies using the Principal Instructional Management Rating Scale: A lens on methodological progress in educational leadership. Educational Administration Quarterly, 47(2), 271–306. https://doi.org/10.1177/0013161x10383412

73.

Hallinger

(2018). Bringing context out of the shadows of leadership. Educational Management Administration and Leadership, 46(1), 5–24. https://doi.org/10.1177/1741143216670652

74.

Hallinger

Bickman

Davis

(1996). School context, principal leadership, and student reading achievement. The Elementary School Journal, 96(5), 527–549. https://doi.org/10.1086/461843

75.

Hallinger

Gümüş

Bellibaş

M. Ş

. (2020). “Are principals instructional leaders yet?” A science map of the knowledge base on instructional leadership, 1940–2018. Scientometrics, 122(3), 1629–1650. https://doi.org/10.1007/s11192-020-03360-5

76.

Hallinger

Heck

R. H.

(1998). Exploring the principal’s contribution to school effectiveness: 1980-1995. School Effectiveness and School Improvement, 9(2), 157–191. https://doi.org/10.1080/0924345980090203

77.

Hallinger

Kovačević

(2019). A bibliometric review of research on educational administration: Science mapping the literature, 1960 to 2018. Review of Educational Research, 89(3), 335–369. https://doi.org/10.3102/0034654319830380

78.

Hallinger

Murphy

(1985). Assessing the instructional management behavior of principals. The Elementary School Journal, 86(2), 217–247. https://doi.org/10.1086/461445

79.

Hallinger

Wang

W.-C.

(2015). Assessing instructional leadership with the Principal Instructional Management Rating Scale. Springer. https://doi.org/10.1007/978-3-319-15533-3

80.

Ham

S. H.

Duyar

Gümüş

(2015). Agreement of self-other perceptions matters: Analyzing the effectiveness of principal leadership through multi-source assessment. Australian Journal of Education, 59(3), 225–246.

81.

Hamaker

E. L.

Muthén

(2020). The fixed versus random effects debate and how it relates to centering in multilevel modeling. Psychological Methods, 25(3), 365–379. https://doi.org/10.1037/met0000239

82.

Hanushek

E. A.

Kinne

Lergetporer

Woessmann

(2020). Culture and student achievement: The intertwined roles of patience and risk-taking (No. w27484). National Bureau of Economic Research. https://doi.org/10.3386/w27484

83.

Hanushek

E. A.

Woessmann

(2017). School resources and student achievement: A review of cross-country economic research. In Rosén

Yang Hansen

Wolff

(Eds.), Methodology of educational measurement and assessment (pp. 149–171). Springer. https://doi.org/10.1007/978-3-319-43473-5_8

84.

Harris

Jones

Adams

(2016). Qualified to lead? A comparative, contextual and cultural view of educational policy borrowing. Educational Research, 58(2), 166–178. https://doi.org/10.1080/00131881.2016.1165412

85.

Hattie

(2017). Visible learning for teachers: Maximizing impact on learning. Routledge. https://doi.org/10.4324/9780203181522

86.

van de Vijver

F. J.

(2015). Bridging etic and emic approaches in cross-cultural management research. In Holden

Michailova

Tietze

(Eds.), The Routledge companion to cross-cultural management (pp. 189–197). Routledge. https://doi.org/10.4324/9780203798706-23

87.

Van de Vijver

F. J.

Kulikova

(2017). Country-level correlates of educational achievement: Evidence from large-scale surveys. Educational Research and Evaluation, 23(5–6), 163–179. https://doi.org/10.1080/13803611.2017.1455288

88.

Heck

R. H.

Larsen

T. J.

Marcoulides

G. A.

(1990). Instructional leadership and school achievement: Validation of a causal model. Educational Administration Quarterly, 26(2), 94–125. https://doi.org/10.1177/0013161x90026002002

89.

Heffernan

(2018). The principal and school improvement: Theorising discourse, policy, and practice. Springer.

90.

Helfrich

(1999). Beyond the dilemma of cross-cultural psychology: Resolving the tension between Etic and Emic approaches. Culture & Psychology, 5(2), 131–153. https://doi.org/10.1177/1354067x9952002

91.

Henrich

Heine

S. J.

Norenzayan

(2010). The weirdest people in the world? Behavioral and Brain Sciences, 33(2–3), 61–83. https://doi.org/10.1017/s0140525x0999152x

92.

Hirshleifer

McKenzie

Almeida

Ridao-Cano

(2016). The impact of vocational training for the unemployed: Experimental evidence from Turkey. Economic Journal, 126(597), 2115–2146. https://doi.org/10.1111/ecoj.12211

93.

Hofstede

G. J.

Minkov

(2010). Cultures and organizations: Software of the mind. McGraw Hill Professional.

94.

Hox

van Buuren

Jolani

(2015). Incomplete multilevel data: Problems and solutions. In Harring

J. R.

Staplecton

L. M.

Beretvas

S. N.

(Eds.), CILVR series on latent variable methodology. Advances in multilevel modeling for educational research: Addressing practical issues found in real-world applications (pp. 39–62). Information Age Publishing.

95.

Jacobson

(2011). Leadership effects on student achievement and sustained school success. International Journal of Educational Management, 25(1), 33–44. https://doi.org/10.1108/09513541111100107

96.

Jak

Cheung

M. W.

(2020). Meta-analytic structural equation modeling with moderating effects on SEM parameters. Psychological Methods, 25(4), 430–455. https://doi.org/10.1037/met0000245

97.

Jorgensen

T. D.

Pornprasertmanit

Schoemann

A. M.

Rosseel

(2021). semTools: Useful tools for structural equation modeling. The Comprehensive R Archive Network. https://cran.r-project.org/web/packages/semTools/index.html

98.

Judge

T. A.

Fluegge Woolf

Hurst

Livingston

(2006). Charismatic and transformational leadership: A review and an agenda for future research. Zeitschrift für Arbeits-und Organisationspsychologie A&O, 50(4), 203–214. https://doi.org/10.1026/0932-4089.50.4.203

99.

Kaplan

(2016). On matrix sampling and imputation of context questionnaires with implications for the generation of plausible values in large-scale assessments. Journal of Educational and Behavioral Statistics, 41(1), 57–80. https://doi.org/10.3102/1076998615622221

100.

Karadag

(2020). The effect of educational leadership on students’ achievement: A cross-cultural meta-analysis research on studies between 2008 and 2018. Asia Pacific Education Review, 21(1), 49–64. https://doi.org/10.1007/s12564-019-09612-1

101.

Keller

Preckel

Eccles

J. S.

Brunner

(2021). Top-performing math students in 82 countries: An integrative data analysis of gender differences in achievement, achievement profiles, and achievement motivation. Journal of Educational Psychology, 114, 966–991. https://doi.org/10.1037/edu0000685

102.

Keller

(1999). Images of the familiar: Individual differences and implicit leadership theories. The Leadership Quarterly, 10(4), 589–607. https://doi.org/10.1016/S1048-9843(99)00033-8

103.

Klein

E. D.

Bronnert-Härle

Boone

W. J.

Muslic

(2022). Constructs of leadership and diverging institutional environments: An exploratory comparative study in the United States and Germany. School Effectiveness and School Improvement, 33, 564–587. https://doi.org/10.1080/09243453.2022.2069129

104.

Kline

R. B.

(2016). Principles and practice of structural equation modeling. Guilford Press.

105.

Konstantopoulos

(2011). Fixed effects and variance components estimation in three-level meta-analysis. Research Synthesis Methods, 2(1), 61–76. https://doi.org/10.1002/jrsm.35

106.

Kossmeier

Tran

U. S.

Voracek

(2020). Charting the landscape of graphical displays for meta-analysis and systematic reviews: A comprehensive review, taxonomy, and feature analysis. BMC Medical Research Methodology, 20(1), 1–24. https://doi.org/10.1186/s12874-020-0911-9

107.

Koyama

Kania

(2014). When transparency obscures: The political spectacle of accountability. Journal of Critical Education Policy Studies, 12(1), 143–171.

108.

Krzywinski

Altman

(2017). Classification and regression trees. Nature Methods, 14(8), 757–758. https://doi.org/10.1038/nmeth.4370

109.

Lawshe

C. H.

(1975). A quantitative approach to content validity. Personnel Psychology, 28(4), 563–575. https://doi.org/10.1111/j.1744-6570.1975.tb01393.x

110.

Lee

(2008). Is test-driven external accountability effective? synthesizing the evidence from cross-state causal-comparative and correlational studies. Review of Educational Research, 78(3), 608–644. https://doi.org/10.3102/0034654308324427

111.

Lee

Hallinger

(2012). National contexts influencing principals’ time use and allocation: Economic development, Societal Culture, and educational system. School Effectiveness and School Improvement, 23(4), 461–482. https://doi.org/10.1080/09243453.2012.678862

112.

Leite

W. L.

Aydin

Cetin-Berber

D. D.

(2021). Imputation of missing covariate data prior to propensity score analysis: A tutorial and evaluation of the robustness of practical approaches. Evaluation Review. Advance online publication. https://doi.org/10.1177/0193841x211020245

113.

Leithwood

Sun

(2012). The nature and effects of transformational school leadership: A meta-analytic review of unpublished research. Educational Admi-nistration Quarterly, 48(3), 387–423.

114.

Leithwood

Sun

Schumacker

(2019). How school leadership influences student learning: A test of “the four paths model.” Educational Administration Quarterly, 56(4), 570–599. https://doi.org/10.1177/0013161x19878772

115.

Lewis

Lingard

(2015). The multiple effects of international large-scale assessment on education policy and research. Discourse: Studies in the Cultural Politics of Education, 36(5), 621–637. https://doi.org/10.1080/01596306.2015.1039765

116.

Dusseldorp

Meulman

J. J.

(2020). Multiple moderator meta-analysis using the R-package meta-cart. Behavior Research Methods, 52(6), 2657–2673. https://doi.org/10.3758/s13428-020-01360-0

117.

Liebowitz

D. D.

Porter

(2019). The effect of principal behaviors on student, teacher, and school outcomes: A systematic review and meta-analysis of the empirical literature. Review of Educational Research, 89(5), 785–827. https://doi.org/10.3102/0034654319866133

118.

Lingard

Martino

Rezai-Rashti

Sellar

(2015). Globalizing educational accountabilities. Routledge. https://doi.org/10.4324/9781315885131

119.

Liu

Zumbo

B. D.

A. D.

(2014). Relative importance of predictors in multilevel modeling. Journal of Modern Applied Statistical Methods, 13(1), 2–22. https://doi.org/10.22237/jmasm/1398916860

120.

Lubis

Sagala

Saragih

A. H.

Sagala

G. H.

(2021). Measuring valuable antecedents of instructional leadership in educational organisations. International Journal of Management in Education, 15(1), 41–57. https://doi.org/10.1504/IJMIE.2021.111815

121.

Luschei

T. F.

Jeong

D. W.

(2021). School governance and student achievement: Cross-national evidence from the 2015 Pisa. Educational Administration Quarterly, 57(3), 331–371. https://doi.org/10.1177/0013161x20936346

122.

McDonald

R. P.

(1999). Test theory: A unified treatment. Erlbaum.

123.

McEwan

P. J.

(2001). The effectiveness of public, Catholic, and non-religious private schools in Chile’s voucher system. Education Economics, 9(2), 103–128. https://doi.org/10.1080/09645290110056958

124.

McGinty

Heffernan

Courtney

S. J.

(2022). Mapping trends in educational leadership research: A longitudinal examination of knowledge production, approaches and locations. Educational Management Administration & Leadership, 50(2), 217–232. https://doi.org/10.1177/1741143221103075

125.

McNabb

(1990). Self-reports in cross-cultural contexts. Human Organization, 49(4), 291–299. https://doi.org/10.17730/humo.49.4.p26p4u58nmn44054

126.

McNeish

Wolf

M. G.

(2021). Dynamic fit index cutoffs for confirmatory factor analysis models. Psychological Methods, 28, 61–88. https://doi.org/10.1037/met0000425

127.

McNeish

Wolf

M. G.

(2022). Dynamic fit index cutoffs for one-factor models. Behavior Research Methods, 55, 1157–1174. https://doi.org/10.3758/s13428-022-01847-y

128.

Miller

P. W.

(2018). The nature of school leadership: Global practice perspectives. Palgrave Macmillan.

129.

Mora-Ruano

J. G.

Schurig

Wittmann

(2021). Instructional leadership as a vehicle for teacher collaboration and student achievement. what the German PISA 2015 sample tells us. Frontiers in Education, 6, 1–12. https://doi.org/10.3389/feduc.2021.582773

130.

Murphy

(1990). Instructional leadership: Focus on curriculum responsibilities. NASSP Bulletin, 74(525), 1–4. https://doi.org/10.1177/019263659007452502

131.

Muthén

L. K.

Muthén

B. O.

(2021). Mplus user’s guide (8th ed.).

132.

Neumerski

C. M.

(2013). Rethinking instructional leadership, a review: What do we know about principal, teacher, and coach instructional leadership, and where should we go from here? Educational Administration Quarterly, 49(2), 310–347. https://doi.org/10.1177/0013161X12456700

133.

F. S.

Nguyen

T. D.

Wong

K. S.

Choy

K. W.

(2015). Instructional leadership practices in Singapore. School Leadership & Management, 35(4), 388–407. https://doi.org/10.1080/13632434.2015.1010501

134.

Niemann

Hartong

Martens

(2018). Observing local dynamics of ILSA projections in federal systems: A comparison between Germany and the United States. Globalisation, Societies and Education, 16(5), 596–608. https://doi.org/10.1080/14767724.2018.1531237

135.

Noman

Gurr

(2020). Contextual leadership and culture in education. In Nobit

G. W.

(Ed.), Oxford research Encyclopedia of education (p. 20). Oxford University Press. https://doi.org/10.1093/acrefore/9780190264093.013.595

136.

Nussbaum

M. C.

(2006). Education and democratic citizenship: Capabilities and quality education. Journal of Human Development, 7(3), 385–395. https://doi.org/10.1080/14649880600815974

137.

O’Donnell

R. J.

White

G. P.

(2005). Within the accountability era: Principals’ instructional leadership behaviors and student achievement. NASSP Bulletin, 89(645), 56–71. https://doi.org/10.1177/019263650508964505

138.

I.-S.

(2020). Beyond meta-analysis: Secondary uses of meta-analytic data. Annual Review of Organizational Psychology and Organizational Behavior, 7(1), 125–153. https://doi.org/10.1146/annurev-orgpsych-012119-045006

139.

Organisation for Economic Co-operation and Develo-pment. (2009a). Creating effective teaching and learning environments.

140.

Organisation for Economic Co-operation and Develo-pment. (2009b). PISA 2006 technical report.

141.

Organisation for Economic Co-operation and Develo-pment. (2010). TALIS 2008: Technical report. https://www.oecd.org/edu/school/44978960.pdf

142.

Organisation for Economic Co-operation and Develo-pment. (2012). PISA 2009 technical report.

143.

Organisation for Economic Co-operation and Develo-pment. (2014). PISA 2012 technical report. https://www.oecd.org/pisa/pisaproducts/pisa2012technicalreport.htm

144.

Organisation for Economic Co-operation and Develo-pment. (2017). PISA 2015 technical report. https://www.oecd.org/pisa/data/2015-technical-report/

145.

Osborn

R. N.

Uhl-Bien

Milosevic

(2014). The context and leadership. In Day

D. V.

(Ed.), The Oxford handbook of leadership and organizations (pp. 589–612). Oxford University Press.

146.

Özdemir

Gümüş

Kılınç

A. Ç.

Bellibaş

M. Ş

. (2022). A systematic review of research on the relationship between school leadership and student achievement: An updated framework and future direction. Educational Management Administration & Leadership. Advance online publication. https://doi.org/10.1177/17411432221118662

147.

Paulhus

D. L.

Vazire

(2007). The self-report method. In Robins

R. W.

Fraley

R. C.

Krueger

R. F.

(Eds.), Handbook of research methods in personality psychology (pp. 224–239). Guilford.

148.

Pereira Sartori Falguera

Lima

M. A.

Ferrari

V. A.

Barriga

G. D.

Mariano

E. B.

(2021). Human development by gender and national culture: A comparative analysis. The Journal of Development Studies, 57(9), 1549–1570. https://doi.org/10.1080/00220388.2021.1919632

149.

Phillips

Ochs

(2003). Processes of policy borrowing in education: Some explanatory and analytical devices. Comparative Education, 39(4), 451–461. https://doi.org/10.1080/0305006032000162020

150.

Pietsch

Leist

(2019). The effects of competition in local schooling markets on leadership for learning. Zeitschrift für Bildungsforschung, 9(1), 109–134. https://doi.org/10.1007/s35834-018-0224-9

151.

Pietsch

Tulowitzki

(2017). Disentangling school leadership and its ties to instructional practices–an empirical comparison of various leadership styles. School Effectiveness and School Improvement, 28(4), 629–649. https://doi.org/10.1080/09243453.2017.1363787

152.

Pokropek

Marks

G. N.

Borgonovi

Koc

Greiff

(2022). General or specific abilities? Evidence from 33 countries participating in the PISA assessments. Intelligence, 92, 101653. https://doi.org/10.1016/j.intell.2022.101653

153.

Pont

Moorman

Nusche

(2008). Improving school leadership. Organisation for Economic Co-operation and Development.

154.

Printy

S. M.

Marks

H. M.

Bowers

A. J.

(2009). Integrated leadership: How principals and teachers share transformational and instructional influence. Journal of School Leadership, 19(5), 504–532.

155.

Pustejovsky

(2022). clubSandwich: Cluster-robust (sandwich) variance estimators with small-sample corrections (R Package Version 0.5.5). https://CRAN.R-project.org/package=clubSandwich

156.

Quartagno

Grund

Carpenter

(2019). Jomo: A flexible package for two-level joint modelling multiple imputation. The R Journal, 11(2), 205–228. https://doi.org/10.32614/RJ-2019-028

157.

Ramírez

M.-J.

(2006). Understanding the low mathematics achievement of Chilean students: A cross-national analysis using TIMSS Data. International Journal of Educational Research, 45(3), 102–116. https://doi.org/10.1016/j.ijer.2006.11.005

158.

Reise

S. P.

Scheines

Widaman

K. F.

Haviland

M. G.

(2013). Multidimensionality and structural coefficient bias in structural equation modeling: A bifactor perspective. Educational and Psychological Measurement, 73(1), 5–26. https://doi.org/10.1177/0013164412449831

159.

Reynolds

Sammons

De Fraine

Van Damme

Townsend

Teddlie

Stringfield

(2014). Educational Effectiveness Research (EER): A state-of-the-art review. School Effectiveness and School Improvement, 25(2), 197–230. https://doi.org/10.1080/09243453.2014.885450

160.

Robinson

V. M.

Lloyd

C. A.

Rowe

K. J.

(2008). The impact of leadership on student outcomes: An analysis of the differential effects of leadership types. Educational Administration Quarterly, 44(5), 635–674. https://doi.org/10.1177/0013161x08321509

161.

Ronfeldt

Loeb

Wyckoff

(2013). How teacher turnover harms student achievement. American Educational Research Journal, 50(1), 4–36. https://doi.org/10.3102/0002831212463813

162.

Rosseel

(2012). Lavaan: An R package for structural equation modeling. Journal of Statistical Software, 48(2), 1–36. https://doi.org/10.18637/jss.v048.i02

163.

Rousseau

D. M.

Fried

(2001). Location, location, location: Contextualizing organizational research. Journal of Organizational Behavior, 22(1), 1–13. https://doi.org/10.1002/job.78

164.

Rutter

Maughan

Mortimore

Ouston

Smith

(1979). Fifteen thousand hours. Secondary schools and their effects on children. Harvard University Press.

165.

Sahin

(2011). The relationship between instructional leadership style and school culture (İzmir case). Educational Sciences: Theory and Practice, 1(4), 1920–1927.

166.

Schaffer

B. S.

Riordan

C. M.

(2003). A review of cross-cultural methodologies for organizational research: A best- practices approach. Organizational Research Methods, 6(2), 169–215. https://doi.org/10.1177/1094428103251542

167.

Scheerens

(2000). Improving school effectiveness. UNESCO; International Institute for Educational Planning.

168.

Scheerens

(Ed.). (2012). School leadership effects revisited: Review and meta-analysis of empirical studies. Springer.

169.

Scherer

Siddiq

Nilsen

(2021). The potential of international large-scale assessments for meta-analyses in education. PsyArXiv. https://doi.org/10.31234/osf.io/bucf9

170.

Schmidt

F. L.

Hunter

J. E.

(2015). Methods of meta-analysis: Correcting error and bias in research findings. Sage.

171.

Seashore Louis

Dretzke

Wahlstrom

(2010). How does leadership affect student achievement? Results from a national US survey. School Effectiveness and School Improvement, 21(3), 315–336. https://doi.org/10.1080/09243453.2010.486586

172.

Sebastian

Allensworth

(2012). The influence of principal leadership on classroom instruction and student learning: A study of mediated pathways to learning. Educational Administration Quarterly, 48(4), 626–663. https://doi.org/10.1177/0013161X11436273

173.

Sebastian

Allensworth

(2019). Linking principal leadership to organizational growth and student achievement: A moderation mediation analysis. Teachers College Record, 121(9), 1–32. https://doi.org/10.1177/016146811912100903

174.

Sen

(1995). Inequality reexamined. Harvard University Press.

175.

Sen

(2001). Development as freedom. Oxford Paperbacks.

176.

Shaked

(2020). Social Justice Leadership, instructional leadership, and the goals of schooling. International Journal of Educational Management, 34(1), 81–95. https://doi.org/10.1108/ijem-01-2019-0018

177.

Shaked

Benoliel

Hallinger

(2021). How national context indirectly influences instru-ctional leadership implementation: The case of Israel. Educational Administration Quarterly, 57(3), 437–469. https://doi.org/10.1177/0013161X20944217

178.

Shatzer

R. H.

Caldarella

Hallam

P. R.

Brown

B. L.

(2013). Comparing the effects of instructional and transformational leadership on student achievement. Educational Management Administration & Leadership, 42(4), 445–459. https://doi.org/10.1177/1741143213502192

179.

Shaw

J. B.

(1990). A cognitive categorization model for the study of intercultural management. Academy of Management Review, 15, 626–645. https://doi.org/10.5465/amr.1990.4310830

180.

Shin

Chung

(2009). Class size and student achievement in the United States: A meta-analysis. KEDI Journal of Educational Policy, 6(2), 3–19.

181.

Sindhvad

Mikayilova

Kazimzade

(2022). Factors influencing instructional leadership capacity in Baku, Azerbaijan. Educational Management Administration & Leadership, 50(1), 81–98. https://doi.org/10.1177/1741143220938364

182.

Sindhvad

Richardson

Ivanov

Lingat

J. E. M.

(2020). Predictors of public school leadership capacity in Bishkek. FIRE: Forum for International Research in Education, 6(2), 24–44.

183.

Sirin

S. R.

(2005). Socioeconomic status and academic achievement: A meta-analytic review of research. Review of Educational Research, 75(3), 417–453. https://doi.org/10.3102/00346543075003417

184.

Slavin

R. E.

(2008). Perspectives on evidence-based research in education—What works? Issues in synthesizing educational program evaluations. Educational Researcher, 37(1), 5–14. https://doi.org/10.3102/0013189X08314117

185.

Smith

W. C.

(Ed.). (2016). The global testing culture: Shaping education policy, perceptions, and practice. Symposium Books.

186.

Spears

(1941). Can the line-and-staff principle unify instructional leadership? The bulletin of the National Association of Secondary School Principals, 25(98), 25–31. https://doi.org/10.1177/019263654102509802

187.

Spillane

J. P.

(2012). Distributed leadership. John Wiley.

188.

Stewart

L. A.

Tierney

J. F.

(2002). To IPD or not to IPD? Advantages and disadvantages of systematic reviews using individual patient data. Evaluation & the Health Professions, 25(1), 76–97. https://doi.org/10.1177/0163278702025001006

189.

Stoet

Geary

D. C.

(2013). Sex differences in mathematics and reading achievement are inversely related: Within- and across-nation assessment of 10 years of Pisa Data. PLOS ONE, 8(3), Article e57988. https://doi.org/10.1371/journal.pone.0057988

190.

Sun

Leithwood

(2015). Direction-setting school leadership practices: A meta-analytical review of evidence about their influence. School Effectiveness and School Improvement, 26(4), 499–523. https://doi.org/10.1080/09243453.2015.1005106

191.

Tan

C. Y.

(2018). Examining school leadership effects on student achievement: The role of contextual challenges and constraints. Cambridge Journal of Education, 48(1), 21–45. https://doi.org/10.1080/0305764X.2016.1221885

192.

Tan

C. Y.

Dimmock

Walker

(2021). How school leadership practices relate to student outcomes: Insights from a three-level meta-analysis. Educational Management Administration & Leadership. Advance online publication. https://doi.org/10.1177/17411432211061445

193.

Tan

C. Y.

Gao

Shi

(2020). Second-order meta-analysis synthesizing the evidence on associations between school leadership and different school outcomes. Educational Management Administration & Leadership, 50(3), 469–490. https://doi.org/10.1177/1741143220935456

194.

Teddlie

Reynolds

(2000). The international handbook of school effectiveness research. Falmer.

195.

Thalmayer

A. G.

Toscanelli

Arnett

J. J.

(2021). The neglected 95% revisited: Is American psychology becoming less American? American Psychologist, 76(1), 116–129. https://doi.org/10.1037/amp0000622

196.

Thorson

G. R.

Gearhart

S. M.

(2018). The adverse effects of economic inequality on educational outcomes: An examination of PISA scores, 2000–2015. World Affairs, 181(3), 286–306. https://doi.org/10.1177/0043820018799425

197.

Tipton

(2015). Small sample adjustments for robust variance estimation with meta-regression. Psychological Methods, 20(3), 375–393. https://doi.org/10.1037/met0000011

198.

Tipton

Pustejovsky

J. E.

(2015). Small-sample adjustments for tests of moderators and model fit using robust variance estimation in meta-regression. Journal of Educational and Behavioral Statistics, 40(6), 604–634. https://doi.org/10.3102/1076998615606099

199.

Urick

A. M.

Bowers

A. J.

(2019). Assessing international teacher and principal perceptions of instructional leadership: A multilevel factor analysis of TALIS 2008. Leadership and Policy in Schools, 18(3), 249–269. https://doi.org/10.1080/15700763.2017.1384499

200.

Urick

A. M.

Ford

T. G.

Page Wilson

A. S.

Consuegra

(2022). How does instructional leadership influence opportunity to learn in mathematics? A comparative study of pathways for Grade 4 students in the U.S. and Belgium. Research in Comparative and International Education, 17, 372–398. https://doi.org/10.1177/17454999221086360

201.

van de Vijver

F. J. R.

(2017). Bias and equivalence in cross-cultural personality research. In Church

A. T.

(Eds.), The Praeger handbook of personality across cultures: Trait psychology across cultures (Vol. 1, pp. 251–277). Praeger.

202.

van de Vijver

F. J. R.

Hofer

Chasiotiappens

(2010). Methodological aspects of cross-cultural developmental studies. In Bornstein

M. H.

(Ed.), Handbook of cultural developmental science (pp. 21–37). Psychology Press.

203.

van de Vijver

F. J. R.

Leung

(2021). Methods and data analysis for cross-cultural research. Cambridge University Press. https://doi.org/10.1017/9781107415188

204.

van Ewijk

Sleegers

(2010). The effect of peer socioeconomic status on student achievement: A meta-analysis. Educational Research Review, 5(2), 134–150. https://doi.org/10.1016/j.edurev.2010.02.001

205.

Viechtbauer

(2010). Conducting meta-analyses in R with the metafor package. Journal of Statistical Software, 36(3), 1–48. https://doi.org/10.18637/jss.v036.i03

206.

Vucetic

(2020). The Anglosphere: A genealogy of a racialized identity in international relations. Stanford University Press. https://doi.org/10.1515/9780804777698

207.

Walker

(2011). Principal leadership in an ERA of accountability: A perspective from the Hong Kong context. School Leadership & Management, 31(4), 369–392. https://doi.org/10.1080/13632434.2011.606269

208.

Walker

Qian

Zhang

(2011). Secondary school principals in curriculum reform: Victims or accomplices? Frontiers of Education in China, 6(3), 388–403. https://doi.org/10.1007/s11516-011-0138-y

209.

Williams

Citkowicz

Miller

D. I.

Lindsay

Walters

(2022). Heterogeneity in mathematics intervention effects: Evidence from a meta-analysis of 191 randomized experiments. Journal of Research on Educational Effectiveness, 15, 584–634. https://doi.org/10.1080/19345747.2021.2009072

210.

Witziers

Bosker

R. J.

Krüger

M. L.

(2003). Educational leadership and student achievement: The elusive search for an association. Educational Administration Quarterly, 39(3), 398–425. https://doi.org/10.1177/0013161X03253411

211.

Wolf

M. G.

McNeish

(2020). Dynamic model fit (R Shiny Application Version 1.1.0). https://doi.org/10.3758/s13428-022-01847-y

212.

Shen

(2022). The association between principal leadership and student achievement: A multivariate Meta-meta-analysis. Educational Re-search Review, 35, 100423. https://doi.org/10.1016/j.edurev.2021.100423

213.

Yang

Gustafsson

J.-E.

(2004). Measuring socioeconomic status at individual and collective levels. Educational Research and Evaluation, 10(3), 259–288. https://doi.org/10.1076/edre.10.3.259.30268

214.

Ydesen

Andreasen

K. E.

(2020). Historical roots of the global testing culture in education. Nordic Studies in Education, 40(2), 149–166. https://doi.org/10.23865/nse.v40.2229

215.

Shin

I. S.

(2018). Multilevel relations between external accountability, internal accountability, and math achievement: A cross-country analysis. Problems of Education in the 21st Century, 76(3), 318–332. https://doi.org/10.33225/pec/18.76.318

216.

Zhang

Y. E.

Liu

Yang

M. M.

Zhang

(2018). Integrating the Split/Analyze/Meta-Analyze (SAM) Approach and a multilevel framework to advance big data research in psychology. Zeitschrift für Psychologie, 226(4), 274–283. https://doi.org/10.1027/2151-2604/a000345

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.15 MB

0.45 MB