Abstract
The Common Guidelines for Education Research and Development were created as a joint effort between the Institute of Education Science and the National Science Foundation in an effort to streamline education research and contribute to an accumulation of knowledge that will lead to improved student outcomes. One type of research that emerged in the guidelines is impact research. In order to achieve the level of rigor expected for an impact study, it is common that a research team will employ a cluster randomized trial (CRT). This article provides empirical estimates of design parameters necessary for planning adequately powered CRTs focused on science achievement. Examples of how to use these parameters to improve the design of science impact studies are discussed.
Keywords
The Guidelines identify six types of research, including (a) foundational research, (b) early-stage or exploratory research, (c) design and development research, (d) efficacy research, (e) effectiveness research, and (f) scale-up research. These six types are combined into three categories of research: research aimed at contributing to core knowledge (types [a] and [b]), research aimed at developing solutions to improve learning (type [c]), and research aimed at contributing to evidence of impact (types [d], [e], and [f]). Each category of studies is critical to knowledge generation, from basic research to the development of interventions to testing the effectiveness of these interventions for improving student outcomes. However, the focus of this article is on the third category, impact research.
According to the Guidelines, the goal of impact research is to generate “reliable estimates of the ability of a fully-developed intervention or strategy to achieve its intended outcomes” (IES & NSF, 2011, p. 9). Stated differently, the goal is to identify “what works” or what interventions improve student outcomes. The three types of impact research differ in terms of “the conditions under which the intervention is implemented and the populations to which the findings generalize” (IES & NSF, 2011, p. 9). Efficacy research focuses on testing an intervention under ideal conditions, often in one population, and may include higher-than-normal levels of support from the program developers. Effectiveness research focuses on testing an intervention under typical conditions, often in a limited number of populations, and with normal levels of support from the program developers. Scale-up research focuses on testing an intervention across a broad array of contexts, in multiple populations, and “without substantial developer involvement in implementation or evaluation” (IES & NSF, 2011, p. 9). In short, these types of impact studies differ in their intent to generalize effects across populations, contexts, and levels of authenticity in the implementation conditions. Across all three types of impact studies, the most common designs are the randomized control trial (RCT) and the quasiexperimental design (QED).
In the past decade, we have seen an increase in the number of impact studies that utilize RCTs and QEDs across the field of education. For example, since 2002, IES has funded over 175 impact studies that employ RCTs (www.ies.gov). The majority of these studies focus on reading and mathematics interventions, with only a small percentage focused on science interventions. However, a recent systematic review by Slavin, Lake, Hanley, and Thurston (2014) provides evidence that there has been a shift in recent years toward more impact studies in science education. Slavin et al. found in their review of studies starting in 1980 that only 23 of the 332 identified studies met inclusion criteria for rigorous designs, which included strong QEDs and RCTs, and that most qualifying studies were published in 2006 or later. More specifically, one qualifying study was published in 2006, five in 2007, and 11 in 2009 or later, for a total of 17 of the 23 studies being published in 2006 or later and only six studies being published between 1980 and 2005.
The role of large-scale impact studies in science education research has generated a lot of attention in the past few years. In 2012, the
In order for these impact studies to generate high-quality and rigorous evidence of the effectiveness of the intervention, the study must have a strong research design. According to the Guidelines, this means a research plan that provides detailed discussion of the following six components: (a) the study design, (b) outcomes of interest and size of impact that is important for each outcome, (c) setting and population of interest, (d) sample and statistical power associated with the sample and design, (e) plan for data collection, and (f) plan for analyses and reporting. All six of these components are crucial to planning a rigorous impact study. However, in this article we focus on the sample and statistical power associated with impact studies of science interventions.
As noted above, the most common designs in impact studies are RCTs and QEDs. Although the statistical power considerations for QEDs are the same as those of RCT, we frame our presentation from the perspective of designing RCTs as this preference is consistent with the Guidelines document, which in referring to researchers indicates that “generally and when feasible, they should use designs in which treatment and comparison groups are randomly assigned” (IES & NSF, 2011, p. 21). This preference is rooted in a specific threat to the internal validity of a QED—
Statistical Power for CRTs
The nested structure of the data in a CRT necessarily makes the a priori power analysis more complex. The power for a two-level CRT with randomization at the school level depends on the following: the alpha level, the number of students per school, the total number of schools, the effect size, the intraclass correlation (ICC), and the percentage of variance explained by the covariate(s) (
Conducting an a priori power analysis is critical because it allows the researcher to determine the necessary sample size to detect an effect of a particular magnitude. If there are too few schools, an effective intervention may be deemed as having no effect when in reality the study was simply underpowered to detect the effect. If there are too many schools, the study may be overpowered, which means more money was spent than necessary. Because impact studies with schools as the unit of assignment tend to be large and very costly, from a financial perspective, it is critical to use accurate estimates of design parameters for planning science education CRTs.
Design Parameters for Science CRTs
ICCs and
However, the challenge for teams designing impact studies focused on science outcomes is that the majority of the studies providing empirical estimates are focused on mathematics and reading outcomes. This leaves science education researchers with few options but to borrow ICC and
Second, unlike reading and mathematics, science is not tested annually in most states. Hence the set of covariates available for reading and mathematics outcomes is not necessarily the same for science outcomes (Westine et al., 2013). For example, for reading and mathematics outcomes, the covariate set that yields the greatest explanatory power is the student-level pretest in the same subject (Bloom et al., 2007) lagged just 1 year before the posttest. Given that in science, students are not typically tested in consecutive years, a science student-level pretest lagged just 1 year is not an option, and the explanatory power of other covariate sets must be explored.
Finally, it is uncertain to what extent science design parameters vary depending on location. Although recently a small number of empirical investigations of design parameters for science outcomes have emerged (Westine et al., 2013; Xu & Nichols, 2010, Zhu et al., 2012), these studies have not considered the variability across location (e.g., across state boundaries). The findings we present in this article represent the largest cross-state compilation and comparison of ICCs and
This comparative work functions as a critical next step toward building a comprehensive resource of empirical estimates of design parameters for science education outcomes that we anticipate will help improve the quality of power analyses for CRTs of science education impact studies so that these studies have the capacity to provide rigorous evidence of impact. Specifically, the purpose of this article is to provide and compare empirical estimates of the following:
Intraclass correlations for science outcomes from three states for Elementary, middle, and high school grades and Two- and three-level nested (hierarchical) designs.
Percentage of variance explained in science outcomes for Individual-level covariates and Cluster-level covariates.
The remainder of this article is organized as follows. First we describe the data, models, and analyses. Then, we present the results, followed by several applications of the empirically estimated design parameters. We conclude with a discussion and suggestions for future research.
Data
The data for this study come from three states: Texas, Wisconsin, and Michigan. In each state, we obtained data for multiple academic years: 5 years for Texas starting in the 2006-2007 academic year, 6 years for Wisconsin starting in the 2005-2006 academic year, and 4 years for Michigan starting in the 2007-2008 academic year. In each state, the data included student-level achievement data from the state test for science, mathematics, and reading. The state tests include the Texas Assessment of Knowledge and Skills (TAKS), the Wisconsin Knowledge and Concepts Examination (WKCE), and the Michigan Educational Assessment Program (MEAP) for Grades 3 through 8 and Michigan Merit Exam (MME) for Grades 9 through 12. Students were linked to schools and districts within each state. The grade in which a subject was tested varies across states and is displayed in Table 1. Note that as expected, across the three states, science is tested much less frequently than mathematics and reading.
Grades in Which Assessments Are Conducted, by Subject and State
We prepared each state data set following a similar protocol in order to obtain a consistent set of usable data across years and states. In each state, and for all years of data, we generated indicator variables from a set of demographic variables, including gender, ethnicity, free or reduced-price lunch as a proxy for socioeconomic status (SES), and limited English proficiency (LEP) status. Collectively, the indicator variables for gender, ethnicity, SES, and LEP make up the set of demographic covariates. Students with nonvalid or duplicate student identification numbers, as well as students requiring a testing accommodation (standard or nonstandard) or that were marked as unethical, were removed from the sample. Students with disabilities or identified as receiving special education were also removed. The percentage of data removed for each state during this cleaning process is provided in Table 2. Note that in Texas, the sample size was also reduced by the masking process. In schools with fewer than five individuals in any single demographic group, achievement scores were masked for all students in that group.
Counts of Students, Schools, and Districts for Science Achievement Unconditional Models, by Grade, State, and Year
Models
We empirically estimate ICCs and
The design parameter estimates from the two models serve to inform different designs. Estimates from the two-level model directly inform the design of a schoolwide intervention study where students are nested within schools and whole schools are randomly assigned to treatment conditions. Estimates from the three-level model directly inform the design of a districtwide intervention study where students are nested within schools that are nested within districts and whole districts are randomly assigned.
We first present the unconditional two- and three-level models, which do not include covariates, to show the ICC calculations. This is followed by an example of the two-level conditional models, which include covariates, to show the calculations for the
Two-Level Unconditional Model
For each grade and data set, we begin with an unconditional model. Following the Raudenbush and Bryk (2002) notation, the Level 1 model is
for
where γ00 is the grand mean,
The ICC (ρ) and corresponding standard error for a large, balanced (i.e., an equal number of students per school) sample are
where
Three-Level Unconditional Model
The three-level model is a natural extension of the two-level model; hence, we omit the presentation of each level individually and present only the mixed model. The unconditional mixed model is
Note that in a three-level model, there are two ICCs because there are three variance components. Following Hedges et al. (2012), the Level 2 and Level 3 ICC and corresponding standard error—assuming a large, balanced (i.e., an equal number of students per school) sample—are
where
Two-Level Conditional Models
In order to explain the variance in the outcome and obtain a more precise estimate of the treatment effect, we include covariates. Depending on the availability of data, covariates can occur at Level 1 or Level 2. In this study we consider covariates at both Level 1 and Level 2. We do not center the covariates in our models as the choice of centering will not impact the estimates of the variance components in the models we consider. However, it is important to note that in analyses that seek to estimate the effect of an intervention, centering does impact the coefficients and interpretation of these estimates. In cases when we use a student-level covariate, the aggregated covariate at the school level (school mean) is also included. The Level 1 covariates we consider include student-level pretests and student-level demographics, such as gender or free/reduced-price lunch status. We also explored cases with no Level 1 covariate and only a Level 2 covariate, such as school mean scores from the previous year, as a previous year’s school mean is a common covariate used in the design of CRTs (Bloom, Richburg-Hayes, & Rebeck-Black, 2007).
For illustration purposes, we provide the models that include both student-level covariates at Level 1 and the corresponding aggregate school-level covariates at Level 2. The models easily extend to cases with covariates at only higher levels and to three-level models. Building from Equation (1), the Level 1 model with student-level covariates can be written as
for
The new Level 2 model, or cluster-level model, is
where γ00 is the adjusted grand mean,
We calculate the proportion of variance explained at Level 1 (
Note that if a covariate is included only at Level 2, for example, last year’s school mean, there is no
Analyses
As noted in Table 1, each state has its own pattern for testing science. Wisconsin is the only state in our data set that tested in Grade 4. For all other grades, there are a minimum of two states represented, with the exception of Grade 8, in which all three states are applicable. We estimate ICCs for the two-level and three-level models for each data set in each grade. Within each state and grade, we perform a simple average on the ICC estimates and standard errors across the number of years available in the state database.
The
States in Which Covariate Models for Science Achievement Are Estimated by Grade
Available for the outcome year 2010-2011.
Available for the two outcome years 2009-2010 and 2010-2011.
Available for the three outcome years 2008-2009 to 2010-2011.
Available for the four outcome years 2007-2008 to 2010-2011.
Available for the five outcome years 2006-2007 to 2010-2011.
Available for the six outcome years 2005-2006 to 2010-2011.
Similar to the ICC analyses, we present a simple average on the
Results
We begin with the ICCs for the two- and three-level unconditional models followed by the
Intraclass Correlations
Table 4 reports the ICCs for the two-level models for science outcomes. The ICC ranges from a low of 0.172 (Texas Grade 8) to a high of 0.312 (Michigan Grade 11). In other words, in Texas Grade 8, 17% of the variance in science test scores is between schools, whereas it is much higher in Michigan Grade 11, at 31%. In general, the ICC is lower in Texas than in Michigan and Wisconsin for grades in which the three states tested science. Another way of stating this is that there is less variability in science achievement between schools in Texas. We can also look for trends across the grades. In Michigan and Wisconsin, there appears to be more variance between schools (higher ICC estimates) for middle and high school grades than for the elementary grades. However, for Texas, the ICCs were similar across all grades.
Unconditional School-Level ICC Averages for Science Achievement Outcomes by Grade and State: Two-Level Model
Unconditional ICCs for Michigan are averages across 4 years of data (2007-2008 through 2010-2011).
Unconditional ICCs for Texas are averages across 5 years of data (2006-2007 through 2010-2011).
Unconditional ICCs for Wisconsin are averages across 6 years of data (2005-2006 through 2010-2011).
The unconditional ICCs for the three-level models, students nested in schools nested in districts, are reported in Table 5. The pattern in the ICCs for Texas and Wisconsin is similar. In both states, the school-level variance is always larger than the district-level variance, ranging from almost twice as large to nearly 3 times as large. For example, in Texas in Grade 5, the results suggest that approximately 8% of the variance in science test scores is between districts, 12% is between schools within districts, and the remaining 80% is between students within schools. That is, there is more variability in science achievement between schools within a district than between districts. However, the pattern differs in Michigan. In Grade 5, the ICC at the district level is nearly twice as large as that at the school level, suggesting there is more variability between districts in Michigan than within districts. In Grade 8, the school-level variance and the district-level variance are very similar. And at Grade 11, similar to the other states, there is more variance at the school level than at the district level. However, the order of magnitude is much larger than in other states.
Unconditional School-Level and District-Level ICC Averages for Science Achievement Outcomes by Grade and State: Three-Level Model
Unconditional ICCs for Michigan are averages across 4 years of data (2007-2008 through 2010-2011).
Unconditional ICCs for Texas are averages across 5 years of data (2006-2007 through 2010-2011).
Unconditional ICCs for Wisconsin are averages across 6 years of data (2005-2006 through 2010-2011).
Percentage of Variance Explained
As identified in Table 3, we examined the strength of the following three sets of covariates: demographics, school-level pretests, and student-level pretests. We present the results in this order since demographics are available for all grades and data sets, school-level pretests are available for all grades, and student-level pretests are less available across the grades and states. The results for the two-level and three-level models are presented in this section.
Table 6 presents the
Average
The two-level model refers to a conditional HLM with students nested in schools. Student demographic covariate variable set is included at Level 1 and aggregated at Level 2.
The three-level model refers to a conditional HLM with students nested in schools nested in districts. Student demographic covariate variable set is included at Level 1 and aggregated at Level 2 and Level 3.
We present the results for the school-level pretests for science, reading, and mathematics in Table 7. Note that because these are school-level covariates, variation at the student level cannot be explained. We begin by comparing the results by subject of the pretest for the two-level model. For example, in Grade 5 in Michigan, the percentage of variance explained by the school-level science pretest is higher than for the reading or mathematics pretest. In fact, in every grade within each of the three states, the science pretest explains more variability than either the reading or mathematics pretest. This holds in the three-level model as well, where the explanatory power of the school-level science pretest exceeds any other subject pretest at both the school and district level. These findings are similar to recent work that found that same-subject pretests are more powerful than pretests from another subject (Bloom et al., 2007). There also appear to be some trends by state. In each within-grade comparison, the explanatory power of the school-level pretest in Texas, regardless of subject, is less than that of Michigan or Wisconsin, although the magnitude of the difference is small in some cases (i.e., Grade 8 Michigan and Texas.)
Average
The two-level model refers to a conditional HLM with students nested in schools. School mean pretest covariates are included at Level 2.
The three-level model refers to a conditional HLM with students nested in schools nested in districts. School mean pretest covariates are included at Level 2 and aggregated at Level 3.
The results for the most recent student-level pretest from each of the three subjects are in Table 8. It is important to keep in mind that the lag on the pretest may be 1, 2, 3, or even 4 years depending upon the specific state, grade, and pretest of interest. Thus, we cannot do a simple comparison within each state or within a grade, as we did in previous tables. However, there are several comparisons to consider. First, we can examine the
Most recent student-level science pretest covariate for science achievement outcome in Grades 8 and 11 for Michigan is a 3-year lag;
Most recent student-level science pretest covariate for science achievement outcome in Grade 8 for Texas is a 3-year lag;
Most recent student-level science pretest covariate for science achievement outcomes in Grade 8 for Wisconsin is a 4-year lag;
The two-level model refers to a conditional HLM with students nested in schools. Student-level pretest covariate is included at Level 1 and aggregated at Level 2.
The three-level model refers to a conditional HLM with students nested in schools nested in districts. Student-level pretest covariate is included at Level 1 and aggregated at Level 2 and Level 3.
It is also interesting to examine the explanatory power in the case when the time lag for a science pretest is greater than the time lag for a reading or mathematics pretest. For example, in Grade 8 in Michigan and Texas, the first available time lag for a science pretest is 3 years. However, the mathematics and reading pretests for Grade 8 are available at 1 year. In the majority of the cases, the 1-year lag mathematics or reading pretest options explained more variation at the student, school, or district level than the 3-year lag science pretest. Given that in many cases the 1-year lag from a different subject had more explanatory power and may be more accessible than the 3-year lag of the same subject, it seems reasonable to consider the shorter lag for a different subject pretest as a viable covariate to increase the power of a study.
Looking Across Covariate Sets
We presented
In addition to the empirical estimates of the
Applications
The design parameters in Tables 4 through 8 can be used in planning science impact studies. Suppose that a team of science researchers are designing a study to test the efficacy of a new science curriculum for fifth graders in the state of Michigan. They plan to randomly assign entire schools to either the new curriculum or the current curriculum. There are approximately 125 fifth graders per school. Given the budgetary constraints of the project, the team is limited to 40 schools and plans to randomly assign half to treatment and half to the comparison group. The team has a fixed number of schools, which is common in practice, given access to schools and budgetary constraints (Taylor et al., 2013). Hence the goal of the power analysis is to determine the smallest effect than has an 80% chance of being found to be statistically significant, also known as the minimum detectable effect size (MDES; Bloom, 1995). Then the team can evaluate whether the MDES is reasonable, given the intervention being tested.
Determining whether the MDES is appropriate is not an easy task. Based on a review of 76 meta-analyses of studies of educational interventions, Hill, Bloom, Rebeck-Black, and Lipsey (2008) found mean treatment effect sizes in the 0.20-to-0.30 range. For science interventions, the Slavin et al. (2014) review of elementary science programs found average treatment effects ranging from 0.03 to 0.42. Although larger effect sizes are appealing to researchers because they require fewer clusters, it is important to think through the ramifications of designing a study with too large of an MDES. For example, suppose that a study is designed to detect an MDES of 0.40. This means that the study has adequate power to detect an effect of 0.40 or higher, but it is not powered to detect an effect smaller than 0.40. Hence if the true effect is in the range noted by Hill et al., the study would be underpowered and the researchers may incorrectly conclude that the treatment had no effect.
For the power analysis, assume that the researchers use Optimal Design Plus (OD Plus; Raudenbush et al., 2011) to calculate the MDES. Estimates of the ICC and
In the first application, the study location and the study design matched exactly to the information provided by our analyses. Naturally, this will not always be the case. In other scenarios, the desired design may be different. For example, the classroom level may be included in the study, making the levels of nesting different, or schools may be blocked by district, resulting in a multisite cluster randomized trial (MSCRT) with districts as sites. Similarly, a different state or specific subset of schools within a specific state may be included in the study. The study may also involve a different grade that our current estimates do not cover. Although our results will not directly match the needs of every impact study design, we argue they can still provide useful information for evaluating estimates of design parameters. Without estimates that align exactly with the proposed study, it is incumbent upon the researcher to cautiously utilize any available information to craft the best argument possible for selecting design parameter estimates.
For example, suppose that the new fifth-grade science curriculum also had an eighth-grade curriculum. The research team wants to test the efficacy of the eighth-grade curriculum and plans to conduct a two-level CRT with students nested within schools. The budget allows for a total of 30 schools, 15 in the treatment condition and 15 continuing with the current science curriculum, and 300 kids per school. The team plans to recruit schools from the state of Ohio. The researchers want to know the MDES, given the current size of the study.
In this case, the estimates of the design parameters in Tables 4 through 8 do not directly match since the estimates for Ohio are not included in the tables. However, the design parameters can still be used to guide the estimates. One option is to examine the student population in Ohio and the configuration of the schools and districts to try to determine if Ohio is more similar to any of the three states included in our analyses. This might include an examination of things like average district size, prior achievement, percentage free/reduced-price lunch, percentage minority, and so on. In many cases, these data are available through state-specific websites or the Common Core of Data (www.ncer.ed.gov/ccd/). If Ohio seems similar to any of the three states in the tables, the research team could make a case to use empirical estimates from that state. However, caution is advised, as without further examples of estimates from other states, it is unknown which factors are the most essential to consider when comparing states. An additional option is to use a point estimate (e.g., the mean, weighted mean, median, or lower/upper bound of the estimated confidence interval) across states of the design parameters presented in this article. In this scenario, it is important to consider the representativeness of these three states, as it is unknown, for example, whether any of the three states constitutes an extreme case. For illustrative purposes, we will select the (simple) mean ICC value, 0.230, for our example power calculation.
The choice of covariate sets also needs to be considered. Looking across Tables 6 through 8 at Grade 8, it appears that the percentage of variance explained by the student-level pretests is very similar to that of the school-level science pretests. Given the additional time and cost to collect student-level pretests, we proceed with the previous year’s school-level science scores (Table 7). Similar to the ICC, we might choose to justify a point estimate for the percentage of variance explained by the school-level pretest. Across the three states, the mean value is 0.849. Combining the mean ICC with the mean percentage of variance explained by the school-level pretest, we can calculate the MDES. In this case, the MDES estimate is 0.205.
Discussion
Across education, there is an emphasis on rigorous research to test the impact of programs and practices. Some argue that science education lags behind other fields in responding to this call (e.g., Minner, Levy, & Century, 2010; Slavin et al., 2014). However, given the recent release of the Guidelines and the federal emphasis on impact studies for determining the effectiveness of educational interventions, we expect that we will continue to see more science education impact studies. It is critical that all impact studies be well designed and adequately powered to yield high-quality evidence of program effectiveness. In order to design adequately powered impact studies of science interventions, teams planning these studies must estimate design parameters to use in the power analyses. This article seeks to improve the accuracy of the power calculations by providing empirical estimates of design parameters for impact studies with science education outcomes. In this article, we focused on the decomposition of the variance in science test scores between districts and schools and the explanatory power for different covariate sets for planning two- and three-level CRTs.
In terms of the variance in science test scores between schools and districts, the empirical estimates suggest these estimates vary across states. On average, the estimates of the between-school and between-district variance from Texas were lower than both Wisconsin and Michigan. The variance between schools and between districts also appeared to vary across grades, with higher grades tending to have higher values. However, this pattern was not consistent across all states, and adding more states to the database would be helpful in testing this pattern.
We examined the empirical estimates of the percentage of variance explained along three dimensions: demographics, school-level pretests, and student-level pretests. In general, the explanatory power of the 1-year lag school-level science pretest was the highest. Given that school-level pretests are much less expensive and easier to obtain than student-level demographics or pretests, we suggest use of the 1-year lag school-level science scores.
The empirical estimates of design parameters provided in this article are meant to serve as a resource and a guide to designing impact studies. In any particular study, the true values of the design parameters may be different than the estimated values. For example, the estimates for each state take into account all schools in the state. However, if the schools were selected from one large district within a state, the schools within that district may be more homogeneous than the schools across the entire state. Hence, characteristics of the sample of schools and the design itself should be considered when selecting the most appropriate design parameters.
The extant empirical work into design parameters for science education impact studies is minimal compared to that focused on reading and mathematics education outcomes. Yet impact studies focused on science education outcomes are held to the same standards as those focused on reading and mathematics education outcomes. If rigorous evidence of the impact of science education interventions is to accumulate and accumulate quickly, as suggested in the Guidelines document, science education researchers should seek to design studies that can detect substantively important effects. Hence, building a resource of design parameters specific to science education is a critical step toward moving forward the agenda to improve the rigor of impact studies in the field.
Future Directions
To date, the current article represents the largest set of design parameters for teams planning impact studies focused on improving science outcomes. We see this as the beginning of a resource for science education researchers planning such studies. However, there are several ways this resource can continue to grow.
One strategy is to encourage research teams to report the variance components and percentage of variance explained from covariate sets. Coupled with details on the sample of schools in the study, this may provide a useful resource for a team planning a similar study with a similar set of schools.
A second strategy is to update the design parameter estimates with additional states and databases. Adding states and other databases may help elucidate the between-state patterns and clarify the potential trend toward more between school variance in the upper grades.
A third extension is to consider additional designs. In science education, it is becoming more common to design studies to involve the teacher level. This includes three-level CRTs with students nested within teachers nested within schools and treatment assigned at the school level as well as MSCRTs with students nested within teachers and teachers assigned to treatment within schools (Heller et al., 2012). Hence, estimating design parameters that include the teacher level is important.
With the emergence of three-level CRTs that include a teacher level, estimating design parameter for studies with teacher outcomes of knowledge or practice become a fourth possible research focus in their own right. Recently, Kelcey and Phelps (2013) estimated design parameters for teacher outcomes related to mathematics and reading content knowledge and practice and found they differed from student outcomes. In fact, there was more clustering for teacher-level outcomes than for student-level outcomes. However, to date, there are no studies reporting design parameters relative to teacher-level outcomes for science.
Finally, the literature base would be enriched by future studies that explore why clustering effects can be quite strong and what variables might cause them to differ. For example, it is somewhat intuitive to posit that between-state differences in the percentage of variance at the school and district levels could be related to the fact that “neighborhood effects” are stronger in some states than others—more specifically, that student or school demographics are clustered differently across states (e.g., as a result of the state or local political, cultural, or economic conditions) and those demographic variables are strongly associated with outcomes. However, this explanation is not consistent with our observation that there is more variance between schools on science outcomes at high school than at elementary school. If the neighborhood effect were the only factor at play, then we might expect greater between-school variance for elementary schools than for high schools, as elementary schools tend to draw from a small number of adjacent neighborhoods and high schools tend to draw from a larger number of scattered neighborhoods with presumably more demographic diversity. This was not the case in this study. Perhaps there is a factor related to the high school curriculum or culture that results in students being more similar in achievement to their classmates than to other high school students, and this factor is less pronounced in elementary schools. Furthermore, it seems possible that differences in the psychometric properties associated with the various state tests (across or even within states) could account for some of the variability in design parameters that we find across populations and grades. In summary, we suspect that the design of CRTs would benefit greatly from a set of rigorous explorations of these interrelated phenomena.
Footnotes
Funding
This work was funded by the National Science Foundation, Award 1118555.
Authors
JESSACA SPYBROOK is an associate professor in the Department of Educational Leadership, Research, and Technology at Western Michigan University. Her research focuses on improving the quality of the designs and power analyses of cluster randomized trials in education.
CARL D. WESTINE is an assistant professor in the College of Education at the University of West Georgia. His research focuses on the improving the efficiency of evaluations, particularly in education.
JOSEPH A. TAYLOR is a principal associate/scientist at Abt Associates. His background is in science education research with an emphasis on meta-analysis, statistical reporting practices, and the design of causal effect studies.
