Sage Journals: Discover world-class research

Abstract

This paper uses a critical quantitative lens to examine the discursive techniques used in states’ approved Every Student Succeeds Act plans to rationalize their subgroup accountability decisions. These subjective decisions about subgroup composition and n-size are used to conduct quantitative analyses to identify performance gaps and, therefore, shape how particular subgroups are constructed as (dis)advantaged. We identify the plans’ discursive legitimation strategies to draw attention to the range of subjective considerations that live behind those two seemingly “objective” decisions. For decisions about subgroup composition, state educational agencies (SEAs) relied heavily on appeals to past practice and student demographics and emphasized their efforts to achieve inclusive and stable accountability measures that effectively identified performance gaps. For n-size decisions, SEAs referenced past practice, statistical expertise, the practices of other SEAs, stakeholder consultation, inclusivity, and their intention to achieve an accountability measure that provides statistical soundness, student privacy, and responsiveness to school demographics.

Keywords

Accountability educational policy equity Every Student Succeeds Act state education agencies subgroups textual analysis QuantCrit

Introduction

In recent years, scholars have increasingly highlighted the need for critical approaches to quantitative analysis, especially in how racial and ethnic groups are constructed as part of administrative panel data like the U.S. Census (Prewitt, 2018). Decisions about how to collect and analyze administrative data have important implications for policy and research, including how congressional districts are drawn and how resources are distributed (e.g., Kenny et al., 2021). Scholars in the field of education have taken up this call by using Quantitative Critical Race Theory (QuantCrit; Castillo & Gillborn, 2023; Gillborn et al., 2018) to examine the groups researchers have used over time in educational research (Viano et al., 2024) and to provide recommendations for improving transparency in administrative data collection and use (Viano & Baker, 2020). In K–12 schooling, one key set of decisions about group construction and related quantitative analyses that remains underexplored through a critical lens, however, is the primary federal lever for educational equity: subgroup accountability.

Subgroup accountability was first introduced with the reauthorization of the Elementary and Secondary Education Act (ESEA) as No Child Left Behind (NCLB) in 2001 to focus schools on performance gaps between student groups, with the idea that groups whose performance is not included in schools’ accountability ratings may be overlooked and underserved (Rentner et al., 2003). Under prior state policies, schools seeking a favorable accountability rating were found to focus their attention on students just below the proficiency cut-off, deprioritizing lower-performing students (Booher-Jennings, 2005). Schools also referred students who scored below “proficient” on state tests to special education at an increased rate as a way of manipulating pass rates, since students with disabilities were typically excluded from taking state standardized tests (Heilig & Darling-Hammond, 2008). Subgroup accountability was created as a federal policy mechanism to address performance gaps, to ensure that a school’s overall accountability status was not achieved at the expense of some groups of students, and to prevent the unintended consequences described above. Under NCLB, student performance had to be disaggregated by major racial and ethnic groups, as well as for students classified as low-income, as having a disability, or as learning English. When NCLB was reauthorized under the Every Student Succeeds Act (ESSA) in December 2015, the legislation retained subgroup accountability as a key policy element, even as it provided states with increased flexibility over many aspects of their accountability systems.

States make two major decisions about subgroup accountability: the first concerns “subgroup composition,” or the demographic groups for which states commit to disaggregate scores. The second decision regards the minimum number of students in a school from a particular group that is needed for that group to be considered large enough to “count” for accountability purposes—also known as “n-size.” N-size determines whether or not a group’s average, school-level achievement will be included in accountability calculations. When the number of students in a particular group within a school does not meet the designated n-size, the group’s performance is not included in calculating performance gaps that matter for whether or not the school is considered as needing improvement.

Both subgroup composition and n-size decisions have been the subject of controversy. For example, civil rights groups have argued that approaches to subgroup composition that combine groups of students, sometimes called “super subgroups,” will mask particular groups’ performance (Alliance for Excellent Education, 2016). At stake in these decisions are which student subgroups schools are being held accountable for and, on the other hand, whose performance is not documented and thereby unable to be addressed through state accountability mechanisms. In their ESSA plan guidance, the U.S. Department of Education (ED) required states to provide a rationale for their n-size decisions and for the inclusion of additional subgroups not specified in their guidance. The student subgroups that should be named and counted for accountability purposes are not self-evident, but rather negotiated and determined at the state level.

A QuantCrit lens highlights how states’ subgroup composition and n-size decision-making exposes the socially constructed nature of subgroup accountability and questions the supposed neutrality of this quantitative measure. QuantCrit’s focus on explicating the assumptions built into administrative data collection and analysis calls for renewed examination into how racial groups are constructed and how performance gaps are calculated for K–12 accountability. With this aim, language-based methodologies that treat policies as texts, which have greatly contributed to understandings of policy formation (Lester et al., 2017), can help unpack the justifications behind quantitative policy measures. Therefore, we examine the discursive techniques states use to rationalize their subgroup accountability decisions in the approved ESSA plans from all 50 states and Washington, D.C. These subjective decisions are used to conduct quantitative analyses that have implications for the identification of performance gaps and, therefore, for delineating schools’ role in reproducing racial inequality and for how particular subgroups are constructed as (dis)advantaged. At a time when the collection and availability of educational data in the United States is under threat, a close examination of how quantitative measures are constructed helps us better understand their value and potential as levers for equity given the centrality of racism in U.S. schooling.

Ongoing Debates about ESSA Plan Elements

Since ESSA’s passage in 2015, stakeholders have raised questions about how subgroups of students are defined and the extent to which n-size may impact the rationing of resources and support at school and district levels. Illustrating this ongoing debate, one study of ESSA-related Congressional hearings found that multiple groups expressed the idea that super subgroups would mask some groups’ performance (Wang, 2020). Scholars have expressed similar concern about how subgroup construction will obscure or reveal how schools (dis)advantage particular groups of students. Umansky and Porter (2020), for instance, examined ESSA in the context of making recommendations to states about how to support English learners (ELs). They highlight that ESSA distinguishes between EL students with disabilities, newcomer students, and students who have remained classified as ELs for more than five years: disaggregation that provides the ability to examine and potentially address students’ intersectional identities and the disparate educational experiences among these populations in ways that were previously impossible. Although state ESSA plans allow students to be classified in multiple, non-mutually exclusive categories, these categories do nevertheless reduce students’ complex, multidimensional identities to the categories the state selects (Covarrubias & Vélez, 2013; Mahiri, 2017). When reviewing all state ESSA plans, Chu (2019) found that states generally used the statutorily required subgroups, using “verbatim words from the U.S. DoE template with few or no substantial, state-specific modification for the other groups” (p. 11). Yet, Chu (2019) also notes that there were substantial differences in the additional subgroups states included in their ESSA plans, indicating that states are taking up the flexibility that they were offered. ESSA is widely perceived as transferring some authority from the federal level to the state level, a shift of power the second Trump administration aims to significantly further, making it all the more important to understand the decisions states are making.

Theoretical Framework

Scholars have increasingly applied a Critical Race Theory (Delgado & Stefancic, 2001) perspective to quantitative methods to emphasize how both quantitative measures and analyses are socially constructed and often serve to reproduce racist power structures (Gillborn et al., 2018). QuantCrit scholars generally call for a more conscious and critical approach to the production, use, and interpretation of numbers that, for example, directly implicates racism as the cause of differences in mean-group performance measures (Castillo & Gillborn, 2023). Although QuantCrit is a recently coined term, the framework’s intellectual roots have been traced to early sociology: in particular, Du Bois’s (1899) data visualizations that shed light on how white power structures shaped Black communities (Castillo & Strunk, 2025; Garcia et al., 2018). A century later, sociologist Tukufu Zuberi (2001) highlighted the importance of disconnecting statistical tools from the “White logic” (Zuberi & Bonilla-Silva, 2008) enmeshed in their eugenic roots. As applied to quantitative methods in the field of education, Covarrubias and Vélez (2013) offered Critical Race Quantitative Intersectionality to challenge how the educational outcomes of students of color are often viewed through reductive quantitative measures, calling for a more precise understanding of students’ complex, intersectional experiences along the educational pipeline.

Building on this conceptual foundation, Gillborn et al. (2018) proposed five central tenets of QuantCrit. The first is “the centrality of racism” (Gillborn et al., 2018, p. 169), and because of this, vigilant attention is required to how quantitative measures may reproduce the “racial status quo” (Gillborn et al., 2018, p. 170). Second, “numbers are not neutral,” as “quantitative data is often gathered and analyzed in ways that reflect the interests, assumptions, and perceptions of White elites” (Gillborn et al., 2018, p. 170). Third “categories are neither ‘natural’ nor given” (Gillborn et al., 2018, p. 169), and fourth, “data cannot ‘speak for itself’” but rather is “open to numerous (and conflicting) interpretations” (Gillborn et al., 2018, p. 173). Finally, quantitative methods should be used with a social justice orientation. We focus on two of these central tenets as demanding a deeper analysis of subgroup composition and n-size policymaking: that “categories are neither ‘natural’ nor given,” and that “numbers are not neutral.” Government officials and researchers make decisions about the categories used to collect and analyze data that reflect their own biases and broader social norms; these categories are often the ones that have been used in past practice, which, when adopted without careful consideration, can replicate racist hierarchies and also obscure important variations within groups (Irizarry, 2015). Far from being fixed or objective, as Baker et al. (2022) highlight, “Categories are social phenomena, laden with the contextual, political, and social understandings of the people who create and use them. Consequently, categories play a central role in our lives, where they are often treated as static when, in fact, categorization is quite dynamic” (p. 7). A QuantCrit perspective helps us understand why states’ rationales and decisions about subgroup composition and n–size are important, as it highlights that these decisions are neither inevitable nor solely driven by allegedly objective considerations of statistical soundness. QuantCrit also compels us to recognize that subgroup composition and n-size decisions have implications for which racial and ethnic subgroups are perceived as (dis)advantaged and how resources are distributed among student groups in a racist, inequitable education system.

QuantCrit provides both methodological guidance for using numbers in research and a conceptual lens that can be applied to studies of quantitative measures in policy that employ a range of methods (Gillborn et al., 2018). We use QuantCrit as a lens in our qualitative analysis of the construction of quantitative measures for subgroup accountability, focusing on states’ rationales for subgroup composition and n-size decisions. In this way, our approach is similar to Campbell-Montalvo (2020, 2021), who applies QuantCrit to a comparative ethnography regarding how school-based personnel categorize students by race, ethnicity, and language when they register students for school and form classes within a grade level. Educators brought their own perceptions of students’ racial characteristics into their judgments about how to classify students’ identities, as they tried to shoehorn students into state-mandated subgroups. This process of racial re-formation was specifically prompted by the mandated subgroups identified by the federal government and passed down to states under NCLB. As Campbell-Montalvo (2020) writes, “Racial re-formation occurs as the Florida Department of Education (FDOE) and public K–12 schools in the U.S. Florida Heartland must change racial/ethnicity raw data into the model the FDOE uses, which only allows for six racial and one ethnic groupings” (p. 181). The consequences of this racial re-formation process were inconsistent placement of students into categories based on adults’ implicit racial ideologies and a dramatic undercounting of students’ indigenous languages. Similar to Campbell-Montalvo (2020), we use a QuantCrit lens and qualitative methods to examine decisions about the classification and counting of students, uncovering in part the subjective processes behind the creation of quantitative educational measures. We draw on tools from rhetoric and linguistics, analyzing states’ “discursive legitimation strategies” (Vaara et al., 2006; Van Leeuwen, 2007), or the linguistic strategies organizations use to justify their practices (Suddaby & Greenwood, 2005). Others have used this approach to examine “micro-level textual practices and strategies” that, for example, media organizations utilized to cover company mergers (Vaara et al., 2006, p. 791) or executive pay during the 2008 financial crisis (Joutsenvirta, 2013). Although discursive legitimation strategies have not been previously employed with a QuantCrit lens, we believe that they provide a language-based, methodological tool well-aligned to the tenets of QuantCrit.

Linguist Theo van Leeuwen (2007) proposed four ways in which language is employed for legitimation. Authorization refers to an appeal to tradition, common practice, or the authority of certain individuals, experts, or regulations. Moral evaluation references values and belief systems that define what is good and bad. Rationalization is the attempt to show fitness in relation to desired goals and effects (instrumental rationalization) or to generalized statements about “the way things are” (theoretical rationalization). Finally, mythopoesis utilizes narrative, such as through moral or cautionary tales, that defines legitimate and illegitimate actors, practices, and structures. The varied discursive legitimation strategies that state educational agencies (SEAs) use to justify their decisions around subgroup accountability reveal through a QuantCrit lens how subgroup categories are “neither ‘natural’ nor given” and that numbers such as n-size are not neutral; they demonstrate that quantitative measures are malleable with significant room for SEA discretion and only gain meaning through interpretation (Gillborn et al., 2018).

Methods

The research question that guided our investigation was, “How do SEAs justify their decisions about subgroup composition and n-size in their approved ESSA plans?” Consistent with our theoretical framework, we examined the plans for the decisions that states made, as well as how states employed various discursive legitimation strategies to justify these decisions. Below, we review the changing federal guidance to understand the context in which SEA officials were creating their ESSA plans before describing our data collection and analysis procedures.

Federal Guidance and the State ESSA Plan Approval Process

The ESSA legislation was signed in December 2015, and it was initially set to take effect in the 2017–18 school year (Klein, 2016a). The Obama administration released guidance throughout 2016, including a template for states to use in constructing their ESSA plans, fleshing out the legislative requirements for identifying and supporting schools in need of improvement (Klein, 2016b). However, the change of administration from Obama to Trump and subsequent appointment of Betsy DeVos as U.S. Secretary of Education following the 2016 election saw the guidance shift in accordance with new priorities. In March 2017, ED released an adjusted state template, with a first deadline for state ESSA plans on April 3, 2017 and a second deadline on September 18, 2017. Sixteen states and Washington, D.C. submitted to ED plans for the earlier deadline (Duff & Wohlstetter, 2019), and when ED began releasing public feedback on those plans in June, their feedback was criticized by both the Council of Chief State School Officers (CCSSO) and ESSA policy entrepreneur Lamar Alexander (R-TN) for being too prescriptive—thus, contradicting ESSA’s intent of returning more decision-making power to the state level (Ujifusa, 2017). In response, ED shifted its feedback process to an initial phone call with SEA officials to discuss points of feedback before releasing public, written comments. As reported by Duff and Wohlstetter (2019), areas identified for improvement in state ESSA plans fell into two categories: “insufficient information” or “violations” of the law’s requirements (p. 300). For example, Florida’s plan was considered incomplete two times because the SEA did not request waivers for required ESSA elements like including EL performance in school accountability and because they included a super subgroup composed of the lowest-performing 25% of students in a school, rather than separately considering the test scores of ELs, students in special education, and minoritized students (Mann, 2018). States sometimes modified their plans in response to ED’s feedback, but they were also often successful in pushing back and maintaining aspects of their original plans (Duff & Wohlstetter, 2019). For instance, Florida was the last ESSA plan to be approved and although they made concessions to ED’s feedback over five rounds of revision, they kept their additional super subgroup identifying the lowest-performing 25% of students in place.

Data Collection and Analysis

We gathered states’ approved ESSA plans and guidance from ED across both the Obama and Trump administrations. We used the ED ESSA templates to determine the relevant portions of each plan to code, as states’ approaches to n-size and subgroup composition were typically located in their response to particular questions from ED asking about those issues. The scope of our analysis is limited, therefore, to a discrete portion of each ESSA plan that specifically addresses these two components of subgroup accountability, and it does not account for how performance gaps between student subgroups may be discussed in other sections of the document. For the few states that elected to use a format that varied from the ED templates, we utilized the table of contents and the “find” function for key terms to locate relevant sections to ensure that we had accurately identified state decisions and legitimation strategies for n-size and subgroup composition.

We uploaded all approved state plans to Dedoose, a program for computer-assisted data analysis. We combined attribute coding with both deductive and inductive coding to qualitatively analyze the documents’ content. We first used attribute coding (Saldaña, 2015, pp. 82–86) to add “descriptors” to each plan for the student subgroups included for accountability purposes and the n-size the state selected. Attaching descriptors to each plan allowed us to identify possible patterns between states’ decisions and legitimation strategies. We also created a matrix (Miles et al., 2014) to record a variety of information about each state’s plan, including whether states used the template from the Obama administration, Trump administration, or an alternate format, as well as the date states submitted their plans, received feedback, resubmitted, and were approved. SEAs could elect to collect data on a distinct set of subgroups and n-sizes for the purposes of reporting, and although we tracked these decisions as well, we focus on SEAs’ choices related to accountability.

Unlike the more concrete decisions about subgroup composition and n-size, our coding scheme for states’ legitimation strategies required a more iterative approach. Each author initially coded half of the ESSA plans, examining states’ rationales for subgroup composition and n-size. At this stage, we used an open coding approach to categorize state justifications, developing our codebook through weekly conversations where we compared states’ discursive practices. An example of an open code we employed was “compromise/balance,” which we defined as when states referred to balancing two competing goals, such as statistical precision and student privacy. We used this code to describe Virginia’s concern for both privacy and the inclusiveness of its accountability system: “Important factors in selecting a minimum n include minimizing the exclusion of student outcomes in the accountability system, while . . . making sure that Personally Identifying Information (PII) for individual students is not disclosed” (p. 10). Other sample open codes included “local demographics to justify subgroup composition” (when states explained their subgroup choices by referring to the demographic composition of youth within the state), “reference to what other states do,” and “past practice” (when states explained their decisions as a continuation of prior subgroup composition and n-size decisions).

Based on the themes generated from open coding, we found strong connections to the literature on discursive legitimation strategies (see Table 1 for definitions and examples of each legitimation strategy). In particular, states frequently relied on rationales such as appealing to “tradition” (Van Leeuwen, 2007) when they referred to “past practice” or to the authority of an “expert,” such as a statistician. Some states used moral evaluation strategies to describe their choices, invoking what was “equitable,” for example, and thus, a legitimate decision. We then re-coded the dataset in light of Van Leeuwen’s (2007) discursive legitimation strategies, and we used these concepts to further develop our codebook and interpret our findings. In the ESSA plans, we found evidence of three of Van Leeuwen’s (2007) legitimation strategies—authorization, rationalization, and moral evaluation. We did not find examples of mythopoesis, a finding that may be attributed to the type of documents analyzed, since policy documents do not often employ the conventions of narrative storytelling. Utilizing the document-level descriptors in Dedoose (attribute codes), we also explored relationships between SEAs’ legitimation strategies and their selected n-size and subgroup composition to identify potential patterns, and we examined the ways in which states described how a legitimation strategy impacted their process of determining subgroup composition and n-size.

Table 1

Legitimation Strategies, Definitions, and Examples

Legitimation Strategy	Definition	Example
Authorization	Legitimacy conferred by “the authority of tradition, custom and law” or “persons in whom institutional authority is vested” (Van Leeuwen, 2007, p. 92)	N/A—all examples fall into one of the subcategories of authorization described below.
Tradition	Reference to tradition, custom, or how things have been done in the past	“We have been operating effectively with a minimum-n of 11 for over 10 years, and we do not believe there is any reason to change that well-established practice” (New Hampshire ESSA Plan, pp. 21–22).
Expert	Referencing the expertise of a person or organization	When states referred to state technical assistance centers, task forces, committees of measurement experts, reports released by ED or by the National Center for Education Statistics (NCES) at the Institute of Education Sciences (IES).
Conformity	Reference to what others do, often appealing to the frequency of a practice	“[Michigan] is not alone in choosing an N-size of 30. It appears that many other state’s accountability systems have come to the same conclusion” (Michigan ESSA Plan, p. 16).
Moral evaluation	Legitimacy conferred by characterizing a practice as good, bad, or normal, implicitly drawing on a value or belief system	Oklahoma described how stakeholders were surveyed to get their feedback on adopting a n-size of 30, and the state lowered its n-size to 10 after stakeholders expressed that a lower n-size was necessary to include the maximum number of students in the accountability system. Emphasis on stakeholder input framed the SEA’s deliberative process as democratic.
Rationalization	Legitimacy conferred to practices by their goals, uses, or effects (instrumental rationalization) or by generalized statements about how they align with what is natural (theoretical rationalization)	N/A—all examples fall into one of the subcategories of rationalization described below.
Instrumental Rationalization	Emphasis on “purposefulness, usefulness and effectiveness” (Van Leeuwen, 2007, p. 102) of practices	N/A—all examples fall into one of the subcategories of intrumental rationalization described below.
Goals-oriented	Focus on the “motives, aims, intentions, goals” (Van Leeuwen, 2007, p. 102) of the actor	Nearly all SEAs justified their n-size decision by reference to how it protected student privacy, constructing themselves as purposeful actors. Massachusetts, which selected a n-size of 20, noted its “long history of reporting vast amounts of data to the general public while at the same time protecting the identity and privacy of its students” (Massachusetts ESSA Plan, p. 51).
Means-oriented	Focus on “the action as a means to an end” (Van Leeuwen, 2007, p. 102); may include emphasis on use or potential of action	Utilizing a statistically-sound n-size was framed as a means to establishing a valid and reliable accountability system that would fairly assess each school’s performance. Rhode Island received requests to lower their n-size to 5, but reported that “to ensure the year-to-year reliability and stability of accountability determinations, Rhode Island will maintain a minimum n of 20” (Rhode Island ESSA Plan, p. 15).
Effects-oriented	Focus on an action’s outcomes	Arkansas proposed to include several additional groups for the purposes of accountability, including “students participating in Gifted and Talented programs” to “increase transparency for the outcomes for these student groups” (p. 21).
Theoretical Rationalization	Reference to “the way things are” (Van Leeuwen, 2007, p. 103)	Invoking the authority of the field of statistics in general, rather than the particular expertise of individuals or organizations, three SEAs employed scientific rationalization by explaining that their n-size followed basic statistical principles, which, for two SEAs (MO, MI), justified a n-size of 30, the largest n-size selected by only 8 SEAs overall. Michigan reported, “Michigan’s minimum n-size of 30 meets widely accepted and studied statistical practices for ensuring reliability” (p. 16).

Findings

Below, we describe how states justified their decisions about subgroup composition and n-size, drawing on Van Leeuwen’s (2007) three relevant discursive legitimation strategies: authorization, rationalization, and moral evaluation. Each strategy can take different forms, as outlined in Table 1. Further, these legitimation strategies are not mutually exclusive; many of the examples we discuss integrate more than one of these types of rationales. In addition, multiple states sometimes employed the same rationale in different ways, depending on the textual framing and emphasis. Some states focused on demonstrating the inclusivity of their accountability measure as a rational decision leading to the most effective identification system, while others framed inclusivity as a moral issue to assess whether their decision-making was equitable. For each component of subgroup accountability, we first discuss what the ED templates required and subsequently illustrate the discursive legitimation strategies SEAs employed in their plans.

State Decision-making About Subgroup Composition

The ED templates from both the Obama and Trump administrations called on states to make three decisions about subgroup composition for the purposes of subgroup accountability: (1) the major racial and ethnic subgroups included; (2) any additional subgroups including super subgroups; and (3) whether or not to include students formerly classified as ELs with students currently classified as ELs. In contrast to the guidance provided for n-size, neither template asked states to justify their rationales for major subgroups or ELs; the January 2017 guidance asked states to ensure they had a “technically and educationally sound rationale” for any additional subgroups used for accountability. Many states justified their decisions around additional subgroups, drawing on a variety of rationales. Some also provided justifications for their decisions about major subgroup composition and ELs, mostly referencing state demographics. We describe states’ explanations for their subgroup composition decisions below, organized by Van Leeuwen’s (2007) legitimation strategies.

Legitimizing Subgroup Composition with the Authority of Tradition

States that combined multiple racial and ethnic categories—creating what is sometimes known as a “super subgroup” (Ujifusa, 2016)—often relied on tradition to justify their decision-making. Tennessee, for example, began using a super subgroup they referred to as “BHN” (meaning “Black-Hispanic-Native American”) for their 2012 NCLB waiver in addition to individual subgroup classifications. Tennessee justified their use of the “BHN” subgroup by saying that it had been used in the past, allowed greater numbers of students to be counted in their accountability system, and thus, would provide more accurate identification of schools in need of improvement (p. 70). Tennessee called on custom and tradition to authorize its strategy by utilizing a measure with which states and districts were already familiar, at the same time that it employed an instrumental rationale focused on the goal of maximizing the number of students for which schools are held accountable. States also invoked tradition to make decisions about whether to classify current and former ELs together or separately. New Mexico, a state with many ELs, used authorization based on past practice, stating that it would not include students who have exited EL programs and citing the precedent set in “prior accountability models” that “preserves historical continuity and comparability with previous years” (p. 88). In other words, New Mexico, like several other states, draws on the authority its past practices have given to this current decision, prioritizing stability over time in its measures for performance gaps.

Legitimizing Subgroup Composition Through Instrumental Rationalization

In justifying their selection of major, additional, or composite subgroups that they would include in their accountability system, SEAs most frequently used instrumental rationales, which can take several forms. SEAs often emphasized how effectively implementing subgroup accountability depends on a practical alignment between the state’s demographic landscape and policymaking, describing these decisions as the “means,” or the process, to arrive at robust accountability measures. Choices about additional or super subgroups tended to use rationales appealing to the effect of a particular choice (“effects-oriented”).

Means-oriented strategies: Major subgroups

Decisions about the inclusion of specific student subgroups were often framed with rationales focused on the means, or the process, through which accurate accountability designations could be made: namely, accountability decisions could only be valid if student subgroups accurately represented the demographics of students within the state. Hawaii included the major subgroups of “Native Hawaiian,” “Filipino,” “White,” “Asian not including Filipino,” “Pacific Islander,” “Hispanic,” and “Black.” Notably, white, Black, and Hispanic students together compose under 25% of Hawaii’s student population. Hawaii references state demographics to justify their subgroup composition with additional consideration for prior student performance, writing that Native Hawaiian (26%) and Filipino (22.1%) are the two largest subgroups, and that Pacific Islander students, although the fifth largest subgroup proportionally, “struggle the most on the academic indicators” (p. 27). Taking a different approach, Montana’s Office of Public Instruction offered a clear, numeric threshold for what constituted a major subgroup for the purposes of accountability. Montana’s plan explains that “a ‘major subgroup’ means 5 percent or more of students statewide” (p. 15), defending their choice to not include Hispanic students as a subgroup until the 2017–18 school year when the SEA anticipated the proportion of Hispanic students would meet that threshold. (Montana’s two major subgroups at the time of ESSA plan submission were white and American Indian students.) Although many states did not provide justifications for the selection of their major subgroups, those that did often instrumentally linked state demographics to the composition of students within the state as a means of accurately reflecting the group performance.

Effects-oriented strategies: Additional subgroups

To describe their choices about additional subgroups, some states used effects-oriented strategies. Arizona’s three additional subgroups focus on students in accelerated math sequences: those who have completed algebra, geometry, or algebra 2 before high school. The SEA describes this choice as allowing the agency “to better track the exceptional work that our LEAs are doing with advanced learners and to recognize their efforts in this area . . . to discover LEAs who are having great successes with students. . ..[so that] the SEA can facilitate peer-to-peer learning networks in the support of student academic achievement” (p. 11). Although this peer-to-peer network was first brought up as an effect of identifying the progress of “advanced learners” (p. 11), the SEA continues, “Because some of our student groups lag far behind others, they will have to grow at a significantly greater rate to close proficiency gaps. Creating a peer-to-peer network will assist LEAs in achieving these rapid growth rates” (p. 11)—rates of change in the achievement of students not in the additional subgroups of “advanced learners.” This rationale focuses on the effects of the SEAs’ decisions—viewing the choice of creating an “advanced learner” additional subgroup as having a positive effect on achievement far beyond this group of students for equity in general. Another example comes from Arkansas, which proposed to include two additional groups (“students participating in Gifted and Talented programs” and “currently classified English learners”) with the rationale that this will “increase transparency for the outcomes for these student groups” (p. 21). The effects referenced here link innovative practices to the overall purpose of subgroup accountability, making sure subgroup outcomes will be tracked and schools will be held accountable for their performance.

Effects-oriented strategies: EL grouping and super subgroups

States also provided instrumental arguments about the effects of their choices to combine multiple student subgroups, often making a case for how the outcome of their decision-making improves the reliability and specificity of the measures produced. For example, Nebraska’s rationale for including former ELs with current ELs was that it “helps to stabilize a subgroup that is less static than other subgroups” (p. 96) and that including former ELs when they are reclassified, would allow schools to “better demonstrate progress” (p. 96). Nebraska’s rationale focuses on how including former ELs, according to the state, will have the effect of schools’ progress being more accurately and fairly measured. Minnesota also invoked an instrumental argument based on effects in its decisions about EL reporting. The state decided to report EL students’ scores both separately and combined (i.e., with and without former ELs included), with the rationale that “This will preserve the ability of the public and educators to focus specifically on current English learners when desired while also honoring the desire of many stakeholders to see former English learners included” (p. 6). Here, Minnesota describes how the use of both groups will allow them to achieve their desired effects of responding to stakeholders and identifying the current performance level of students who might be in most need of continued language services.

Other states legitimized the combination of student groups for the effects of maximizing the number of students and student groups for which schools are held accountable and of enhancing accountability for lower-performing student groups. Mississippi formed a super subgroup that was composed of the “lowest performing 25%” on the state assessment. The state’s ESSA plan described this as a strategy to “catch” all students who are struggling and identify the schools in which they are located, so that regardless of the students’ racial or ethnic classification they would have a higher chance of being included in at least one subgroup. Massachusetts also created a “High Needs” subgroup (consisting of students classified as “economically disadvantaged, students with disabilities, or formerly/current English language learner subgroups”) with a similar rationale, calculating that this would allow 150 more schools “to be held accountable” than if the three subgroups were examined independently only. These rationales emphasized the effect of these choices being the states’ enhanced ability to identify students from historically marginalized and/or lower-performing student groups at the maximum number of schools possible. Vermont also focused on the effects of its subgroup decisions for the scope of subgroup accountability, responding to the state’s relatively low proportion of students of color by creating two additional super subgroups: “Historically Marginalized Students” and “Historically Privileged Students.” Vermont defines historically marginalized students as essentially any student with one or more of the following characteristics: “ethnic and racial minorities, English learners, students with Free and Reduced Lunch, students with disabilities, and students who are migrant, foster, or homeless” (p. 14). Historically privileged students are those who are not classified with any of those characteristics. Vermont explains that this will “increase transparency around student performance” to help account for “Vermont’s small schools and relatively low levels of diversity [which] often mean that student groups are too small to show data which might point to inequities in experience” (p. 14). Vermont goes on to state that if a school has enough students with any one of the characteristics included in the Historically Marginalized category, subgroup data would be reported by those characteristics. However, the state will also report the scores of the super subgroup “historically marginalized students” to ensure that students with any of the “characteristics of concern” (p. 15) are captured in at least one reportable subgroup. Vermont notes that without this subgroup, they would not be able to report data from many student subgroups in their schools because of state demographics. Here, Vermont’s rationale acknowledges the instrumental practicalities of subgroup composition in a state with relatively little racial or ethnic diversity while focusing on the positive effects of super subgroups, in contrast to how civil rights groups have expressed concerns about super subgroups masking some student subgroups’ performance.

Florida went further than other states in justifying its use of super subgroups, not only arguing for their positive effects, but also critiquing ED’s reliance on assessing student progress through individual subgroups. Florida included a super subgroup of students scoring as the “lowest-performing 25%” (p. 9), as Florida wrote that students from groups “that are historically low-performing” are overrepresented in the “lowest-performing 25%” (p. 9). Florida used similar, effects-oriented rationales to other states to establish this super subgroup that increases the likelihood students will be counted even if a particular subgroup is underrepresented within a school. However, as a way of rationalizing their choice, which received pushback from ED, Florida also expressed that ED’s subgroup plan could have unintended consequences similar to the concept of “bubble kids” for states who followed ED’s subgroup guidance, in which teachers prioritize the achievement of students close to a proficiency threshold (Booher-Jennings, 2005). Under ED’s plan, according to Florida, when schools do not have enough students in a subgroup for that group to count for accountability metrics, schools are incentivized to focus on subgroups that do count for accountability ratings, even when those groups may not be those most in need. According to Florida’s plan, their strategy will effectively create the proper incentives and forestall unintended consequences. Florida expresses a classic accountability theory of action: that the subgroup composed of the “lowest-performing 25% . . . provid[es] a real incentive in the school grades formula for aligning instructional resources to focus on low performers, and in so doing, rewards schools and LEAs that are successful in reducing achievement gaps” (p. 10). Florida sought to harness the power of accountability incentives to improve achievement without the unintended consequences of prioritizing some groups’ achievement over others. Florida’s legitimation strategy demonstrates a focus on the positive effects of its choices as contrasted to potential negative effects of federally recommended subgroup composition.

State Decision-Making About N-Size

Looking across fifty states and Washington, DC, eight SEAs selected a n-size of 30, two a n-size of 25, 20 a n-size of 20, four a n-size of 15, 1 a n-size of 11, and 16 a n-size of 10 (see Table 2). Distinct from the guidance about subgroup composition that only asked states to provide reasoning for additional subgroups, the n-size section in both templates required SEAs to justify their decisions, specifically with regard to how the selected n-size was statistically sound and ensured the privacy of individual students. In other words, too small of a n-size may result in student privacy violations, as well as citing schools for subgroups that are not large enough to have meaningful, statistically significant differences. Increasing the n-size is thought to build in more privacy protections and improve statistical soundness. However, too large of a n-size may undermine transparency and subgroup accountability, as schools with smaller subgroup populations may then not be held responsible for equitable performance across its student body. The original template also required SEAs to ensure that their n-size led to the greatest inclusion of students and subgroups and to report on the number of students overall and the number of students in each subgroup that SEAs would not be held accountable for with the chosen n-size. For SEAs with a n-size greater than 30, the original template asked SEAs to provide additional justification for their choice and to specify the number of schools that would not be held accountable for each student subgroup; however, no SEAs adopted this large of a n-size. Although stakeholder input on the state ESSA plan was deemphasized in the DeVos guidance as a whole (Klein, 2017), the revised template asked SEAs to explain how their n-size was determined and specifically how collaboration with stakeholders (teachers, parents, etc.) was part of the process.

Table 2

SEAs and N-Size Selection

N-Size	Number of SEAs	SEAs
30	8	California Kansas Michigan Missouri New York North Carolina Tennessee Virginia
25	2	Texas Vermont
20	20	Alabama Arizona Colorado (16 for achievement, 20 for growth) Connecticut Hawaii Idaho Illinois Indiana Iowa Massachusetts Minnesota New Jersey New Mexico Oregon Pennsylvania Rhode Island South Carolina Washington West Virginia Wisconsin
15	4	Arkansas Delaware Georgia Ohio
11	1	New Hampshire
10	16	Alaska Florida Kentucky Louisiana Maine Maryland Mississippi Montana Nebraska Nevada (10 except 25 for comprehensive) North Dakota (greater than 9) Oklahoma South Dakota Utah Washington, DC Wyoming

In this section, we discuss how SEAs justified their n-size selections through the lens of Van Leeuwen’s (2007) discursive legitimation strategies. The templates elicited specific considerations, as described above. Several of those considerations, including how the n-size protects student privacy and should be statistically sound, point to instrumental rationales focused on goals and the means of getting to those goals. Others, like incorporating stakeholder feedback and the trade-offs between n-size and inclusivity of the accountability system, invoked moral arguments. Since these considerations were mentioned in the templates, it is not surprising that we find them used across SEAs with a range of n-sizes. Other strategies that states frequently used, that were not prompted by the template, drew on authorization—referencing past practice (tradition), invoking statistical experts (expertise), and/or demonstrating similarities with other SEAs (conformity). SEAs also contextualized the decision within local demographics, a more instrumental rationale.

Legitimizing N-Size Through Authorization

SEAs frequently referenced past practice, the actions of peer SEAs, and statistical expertise to legitimate their n-size decisions. In this way, SEAs sought to justify quantitative measures in their policies with the authority of routine and the avoidance of derivations from status quo practices.

The authority of tradition: Past practice

In legitimizing their n-size, SEAs commonly referred to past practice or n-sizes used previously (e.g., under NCLB). SEAs suggested that because their former practices were perceived as legitimate for a sustained time period, their current practices should be given the same authority. New Hampshire described how it has been using a n-size of 11 since NCLB, saying, “We have been operating effectively with a minimum-n of 11 for over 10 years, and we do not believe there is any reason to change that well-established practice” (pp. 21–22). This appeal to past practice helps explain New Hampshire’s unique n-size of 11. Similarly, several states referred in their ESSA plans to their careful reconsideration of n-size conducted for the NCLB flexibility waivers offered under the Obama administration starting in 2011. States, such as Minnesota, Maine, Louisiana, and Connecticut, indicated that their n-sizes were lowered at the time of applying for and receiving the NCLB waiver, and they are maintaining that n-size, which they argue has already shown to create a more inclusive accountability system than the one under NCLB.

The authority of tradition was integrated with instrumental rationalizations as SEAs also emphasized that maintaining the same n-size allowed them to use a n-size that had been tested and proven to uphold statistical soundness and student privacy. Virginia explained how keeping the same n-size would help ensure reliability in its accountability system: “Virginia will continue to use a minimum n of 30 students for accountability purposes. For several years, this number has been used to identify low performing schools without inappropriately identifying successful schools or permitting unsuccessful schools to avoid accountability” (p. 10). Minnesota similarly explained that maintaining the same n-size with the adoption of ESSA would help ensure stability in the accountability system and avoid drastic swings in school-level performance. Van Leeuwen (2007) explains that drawing on the authority of tradition and custom has generally declined and may be considered a weak legitimation strategy because of its inadequate explanation for why a practice is maintained. By combining the authority of tradition with instrumental rationalization that provides concrete reasons for keeping the status quo, SEAs bolster their n-size justifications.

The authority of conformity: Reference to practices of other SEAs

Some SEAs referenced other states’ decisions to justify their n-size selection. Usually this rationale was not used independently, but rather as an additional argument to further the legitimacy of the n-size. In the case of an ambiguous decision that gave SEAs flexibility, the practice of other SEAs was used to legitimate one’s own choice in a process similar to mimetic isomorphism (DiMaggio & Powell, 1983). For instance, Utah referred to National Center for Education Statistics (NCES) data from 2010 to point out that it selected the most common n-size (10), demonstrating a focus on high-frequency practices in the institutional field, which is common when the authority of conformity is invoked (Van Leeuwen, 2007). South Dakota explained that its ESSA working group, which was composed of several stakeholder groups, spent time considering how other states had determined their n-sizes. Michigan also referred to the practice of other states, but surprisingly did this to justify their selection of a n-size of 30, which was only used by 8 SEAs in total: “[Michigan] is not alone in choosing a N-size of 30. It appears that many other state’s accountability systems have come to the same conclusion” (p. 16). Even though Michigan adopted a low-frequency practice, it sought to demonstrate that it was not alone in choosing a large n-size to help justify a decision that could be seen by ED or other organizations as undermining subgroup accountability pressure.

Expert authority: Invoking statistical expertise

Nearly half of SEAs invoked the authority of statistical experts or expertise in legitimizing their n-sizes to demonstrate both statistical soundness and student privacy. Eighteen SEAs referred to state technical assistance centers or task forces established or contracted to consult with the SEAs. Many states established committees composed entirely or partially of measurement experts to advise on the n-size selection process. Other SEAs referred explicitly to guidance or reports released by ED or by NCES at the Institute of Education Sciences (IES), appealing to federal authority to gain federal approval of their plan. Four SEAs described drawing upon guidance from ED’s Privacy Technical Assistance Center. The revised template recommended that SEAs consult the IES January 2017 report, “Best Practices for Determining Subgroup Size in Accountability Systems While Protecting Personally Identifiable Student Information,” in ensuring the privacy of individual students. Given that the revised template explained that states “should consult” this document, it is notable that few SEAs (7) referred explicitly to this guidance. Three SEAs reported consulting with 2011 NCES Statewide Longitudinal Data Systems guidance, “Statistical Methods for Protecting Personally Identifiable Information in Aggregate Reporting.” Overall, SEAs were more likely to refer to state-based rather than federal sources, with about a third invoking state expertise and a quarter drawing on federal expertise, with some SEAs referencing both.

Invoking the authority of the field of statistics in general, rather than the particular expertise of individuals or organizations, three SEAs employed a form of theoretical rationalization by explaining that their n-size followed basic statistical principles, which, for two SEAs (Missouri, Michigan), justified a n-size of 30, the largest n-size selected by only eight SEAs overall. Michigan reported, “Michigan’s minimum n-size of 30 meets widely accepted and studied statistical practices for ensuring reliability . . . investigation of research and scholarly papers that indicated the number thirty was large enough to yield statistically reliable results” (p. 16). As the example of Michigan illustrates, we found that references to statistical experts or expertise were generally more common for states with larger n-sizes. SEAs with a n-size of 30 invoked statistical experts or expertise at a rate of 75% in their ESSA plans, while 35% of SEAs with a n-size of 20 and 44% of SEAs with a n-size of 10 did so. Thirty was cast by several SEAs, such as Michigan, to be the most statistically sound choice, with the reliance on statistical expertise and experts helping justify the selection of the maximum n-size, a choice that could potentially undermine the power of subgroup accountability.

Legitimizing N-Size Through Instrumental Rationalization

SEAs often used instrumental rationalization to demonstrate the purpose and utility of their n-size decisions. In particular, we found that SEAs emphasized their intention to protect student privacy and their use of statistically sound and demographically relevant approaches to ensure reliable and valid accountability measures.

Goals-Oriented Instrumentality: Protect Student Privacy

Given the prompts in the ED templates, we expected to find SEAs referring to how their selected n-sizes were chosen with the intention of protecting the privacy of individual students. Indeed, in our analysis we found that nearly all [50 of 51] SEAs justified their n-size decision by reference to how it would protect student privacy, constructing themselves as goals-oriented actors (Van Leeuwen, 2007). Massachusetts, which selected a n-size of 20, noted its “long history of reporting vast amounts of data to the general public while at the same time protecting the identity and privacy of its students” (p. 51). Like several other states, Massachusetts expressed that the need for privacy is balanced with the need for transparency, a second, potentially competing goal. Overall, SEAs with a n-size of 10 discussed transparency at more than double the rate of SEAs with larger n-sizes. SEAs with smaller n-sizes were able to highlight the transparency provided by their decision, while SEAs with larger n-sizes had stronger claims to privacy and statistical soundness.

Means-Oriented Instrumentality

Statistically Sound

Similarly reflecting the templates’ guidelines, most SEAs made explicit reference to the statistical soundness of their selected n-size as a means to establishing a valid and reliable accountability system that would fairly and consistently assess each school’s performance. Nebraska explained that they reviewed their performance data historically and determined that a n-size of 10, even in comparison to a more cautious n-size of 25, “showed fairly stable results from year to year” (p. 114) and thereby strengthened their measure’s reliability. Rhode Island, which selected a n-size of 20, explained that “while a lower n-size would include more students, it would also sacrifice year-to-year reliability” (p. 15). The state received requests to lower their n-size to 5, but reported that “to ensure the year-to-year reliability and stability of accountability determinations, Rhode Island will maintain a minimum n of 20” (p. 15). Rhode Island’s n-size justification prioritizes statistical soundness over other considerations, such as stakeholder input and the maximum inclusion of students for which schools would be held accountable. In this case, statistical soundness is framed as a primary way to attain a legitimate measure for subgroup accountability.

Appropriate for School Demographics

A fifth of SEAs noted how the demographics of their schools critically influenced their n-size selection. States with many small and rural schools, such as Oklahoma, Nebraska, South Dakota, Montana, Vermont, Alaska, and Maine, noted that larger n-sizes would exclude a large portion of their schools from subgroup accountability. For some SEAs, such as Nebraska, South Dakota, and Maine, this led to the adoption of a n-size of 10, the lowest n-size adopted by SEAs. Indeed, we found that school demographics were more likely to be referenced by SEAs that chose smaller n-sizes. Half of SEAs with a n-size of 10 referenced local demographics to justify their n-size selection, while only a few SEAs with a n-size greater than 10 did so, including no SEAs with a n-size of 30.

Legitimizing N-Size Through Moral Evaluation

SEAs also framed their n-size decisions as normatively good by showing how they were inclusive and incorporated stakeholder participation—criteria requested in the ESSA templates.

Moral Evaluation: Demonstrating Inclusivity

Many states described how they tried to find the most inclusive n-size that would enhance accountability for the performance of students and student subgroups, and thus, would be a stronger tool to promote equity through subgroup accountability. We found that almost three-quarters of SEAs discussed the trade-offs between n-size and the number of schools and students that would be included or excluded, an analysis requested in the original template. Some SEAs simply mentioned that their selected n-size maximized the number of students, schools, and/or student subgroups while maintaining statistical soundness and student privacy. Others engaged in a more extensive discussion of various n-sizes and their corresponding degree of inclusion and exclusion. SEAs frequently referred to finding a “balance” or “compromise” between inclusivity and reliability of the accountability indicators from year to year. North Carolina, which ultimately selected a n-size of 30—the largest n-size selected among SEAs, included extensive tables in their plan that demonstrated how n-sizes of 10, 15, 20, 25, 30, 35 and 40 would impact the number and percentage of schools, students, and students from each subgroup included in their accountability system, using both grades 3–8 and grade 10 as case studies (p. 23). North Carolina used these tables to show that a n-size of 30 would mean that “for most subgroups the schools included contain the majority of the targeted student population” (p. 20). In this way, North Carolina favored subgroup accountability at the state level, but minimized its force at the school level.

A quarter of SEAs explained that they sought to maximize the number of students and student subgroups that schools were held accountable for by aggregating students over time, grades, or subgroups in order to reach the designated n-size. To account for its significant number of rural and small schools, Oklahoma includes the option of aggregating student numbers over the previous three years in order to reach the n-size for accountability. A few SEAs reported that they would aggregate students across grades to reach the chosen n-size and maximize the inclusion of student subgroups. Indiana, aggregating grade level data across grades 3–8 and grades 9–12, explained that “aggregating grade level data provides for more schools to achieve the required minimum number of students determined necessary to be included in the accountability system” (p. 48). These approaches—aggregating over time or grades—allow SEAs with large rural populations and wide variations in school size to ensure that schools with small populations of students from targeted subgroups would still be subject to subgroup accountability, without selecting a n-size that is too small to ensure statistical soundness.

Even though a discussion of inclusivity was requested in the original ESSA template, not all SEAs referenced this issue in their n-size rationales. SEAs with a n-size between 10 and 30 were more likely than SEAs with n-sizes at the extremes of 10 or 30 to discuss the trade-offs between n-size and the degree to which schools and students overall and in particular student subgroups would be included or excluded in the accountability system. All SEAs with a n-size of 15 and three-quarters of SEAs with a n-size of 20 considered these trade-offs, while just under two-thirds of SEAs with a n-size of 10 or 30 did so. On the other hand, SEAs with n-sizes of 10 were most likely to explicitly discuss the extent to which their n-size was inclusive or exclusive, highlighting the relationship between a small n-size, increased inclusivity, and a strengthened subgroup accountability policy.

Moral Evaluation: Stakeholders Consulted

Also among the most frequent moral strategies for legitimation was the reference to consulting with stakeholders in determining n-size. Administrators, teachers, curriculum designers, subgroup (e.g., special education, English learner) specialists or advocates, unions, community and non-profit organizations, business leaders, families, and students were among the groups consulted by SEAs (although not all SEAs reported consulting with all of these groups). Some SEAs simply stated that they incorporated stakeholder feedback, while other SEAs (e.g., Vermont) provided detailed accounts of the methods they used to elicit community input, including focus groups, surveys, listening tours, conferences, and advisory groups. The descriptions of stakeholder consultation—especially those that include accounts of disagreement among the public—depict SEAs as appealing to moral values like democratic deliberation and transparency in an inclusive process for determining n-size. For some SEAs, stakeholder input (or a compromise among stakeholder input) was framed as the most influential factor in selecting a n-size. South Carolina discussed the divergence of opinions about n-size among its stakeholders, demonstrating how civil rights and social justice organizations tended to advocate for lower n-sizes to maximize subgroup accountability, while administrators advocated for higher n-sizes that would maximize statistical reliability and validity but also reduce the chances of schools being cited:

South Carolina previously used subgroup n-counts of 40 (1999–05) and 30 (2005–14); however, based on stakeholder feedback from the Urban League, Hispanic Alliance, and other civil rights groups, the state will use an n-size of 20 for the ESSA reporting and accountability. These organizations maintained that a smaller n-size would allow more schools to be included in the full reporting of subgroup performance . . . Additionally, district superintendent and instructional leader roundtable groups advocated for subgroup n-sizes of 40 or for a percentage model whereby a subgroup would be reported if it met a specific percentage threshold of the full population. These requests were grounded in a desire to increase validity and reliability and reduce deceptive or misleading interpretations that arise from small sample sizes. The SCDE considered all of these recommendations and selected a compromise of reporting and setting performance targets for subgroups with n-sizes of 20. South Carolina has seen tremendous achievement gaps for specific student groups including economically disadvantaged, students with disabilities, and African American students. (pp. 8–9)

South Carolina explained their final decision as a compromise between the different interests in the community, although they ultimately favored the smaller n-size recommended by civil rights groups that would be more sensitive to mean-group differences in school level performance.

Discussion

In the last few decades of federal education policy, subgroup accountability has been a primary, though contested, lever for identifying and reducing school-level differences in mean-group student performance. In this paper, we apply a QuantCrit lens to the discursive tools states use to rationalize their subgroup accountability policy under ESSA with particular attention to two central decisions: subgroup composition and n-size. Van Leeuwen’s (2007) discursive legitimation strategies draw attention to the range of considerations that live behind those two seemingly objective decisions that are, in fact, quite subjective and have major implications for equity, highlighting that “[n]umbers are no more obvious, neutral, and factual than any other form of data . . . [and] at every stage there is the possibility for decisions to be taken that obscure or misrepresent issues that could be vital to those concerned with social justice” (Gillborn et al., 2018, p. 163). The range of strategies SEAs used in their ESSA plans reveals in part the construction of quantitative policy measures through a complex and subjective policy formation process.

Authorization, rationalization, and moral evaluation are three mechanisms that rationalize the social construction of data (Castillo & Gillborn, 2023), calling into question the objective veneer of quantitative policy measures. Authorization strategies commonly included appeals to tradition in referencing prior decisions and practices; the authority of conformity in referencing other states’ decisions; or expert authority, as when states consulted a statistical expert. SEAs also employed rationalization strategies, primarily instrumental rationalizations that focused on their desired goals (e.g., protecting student privacy) or emphasized the means through which they achieved the overall goal of accurately measuring performance gaps within a state (e.g., adding subgroups to better match changing state demographics). Finally, states used moral evaluation strategies to describe the values to which SEAs appealed, such as inclusivity and democratic deliberation. For decisions about subgroup composition specifically, SEAs relied heavily on appeals to past practice and student demographics and emphasized their efforts to achieve inclusive and stable accountability measures that effectively identified performance gaps. For n-size decisions, SEAs referenced past practice, statistical expertise, the practices of other SEAs, stakeholder consultation, inclusivity, and their intention to achieve an accountability measure that provides statistical soundness, student privacy, and responsiveness to school demographics. SEAs’ rationales for n-size and subgroup composition in their approved ESSA plans reveal discursive strategies that can be powerful in educational policymaking, and the array of strategies employed indicates how discursive legitimation strategies offer flexible devices for states and other stakeholders to exercise their agency in the policymaking process.

Through a QuantCrit lens, the range of discursive legitimation strategies SEAs employed in their subgroup accountability policymaking complicates how we view subgroup composition and n-size. Policy measures that are ultimately perceived as “statistically sound,” a consideration ED explicitly encouraged, are constructed from varied factors, including but not limited to considerations of validity and reliability. States claimed that they were making valid and reliable decisions by consulting a statistician or an ED guidance or by creating tables that demonstrated the trade-offs of different n-sizes. However, we also saw that subgroup composition and n-size decisions were influenced, for example, by past practice, the practice of other SEAs, and advocacy from public stakeholders—none of which necessarily prioritize reliability and validity. For instance, South Carolina attributed its decision to decrease its n-size to how it negotiated divergent pressures from different advocacy groups and ultimately prioritized a smaller n-size that might pick up performance gaps, even when there are few students in a subgroup, over concerns for reliability and validity. A QuantCrit perspective, which reminds us that numbers are not neutral (Gillborn et al., 2018), not only demands attention to the range of considerations behind claims of statistical soundness, but also requires us to ask what additional factors not included in SEAs’ rationales might be behind decisions that determine quantitative policy measures. In addition to the rationales we discuss in this paper, SEAs’ unwritten considerations may include states’ prior experiences with subgroup accountability and federal oversight, perceptions of changing state demographics, interactions with local districts, and—importantly—dominant societal norms and racism.

QuantCrit Implications for Using Discursive Legitimation Strategies

Although we primarily apply QuantCrit as a theoretical lens that demands the unpacking of subgroup accountability decision-making, here we briefly also look to methodological guidance coming from QuantCrit to evaluate the specific discursive legitimation strategies SEAs used. Some discursive legitimation strategies, such as moral evaluation and instrumental rationales focused on goals or means, had more common alignment with QuantCrit scholars’ call for intentional and transparent decision-making oriented toward social justice that incorporates the voices of implicated community members in the production of quantitative measures (Castillo & Gillborn, 2023). Regardless of the specific strategy states used, adding more transparency and detail to subgroup accountability decisions, seeking more stakeholder input into those decisions, adding subgroups to be more sensitive to state and local demographics, and linking effects to equity while using asset-based framing when describing students are all important ways that states can write about their choices that better reflect QuantCrit principles.

These elements, however, were often absent from the discursive legitimation strategies we found. For example, appeals to the authority of tradition or conformity often did not have additional explanation for how past practice or the practices of others would enhance equity, lacking intentionality and a clear orientation toward social justice. An emphasis on statistical expertise absent the input of community members may reproduce racist and deficit-oriented quantitative measures. For instance, we found that the authority of experts was most frequently used to legitimize a n-size of 30, the largest n-size selected by SEAs, raising the question of whether statistical expertise tends to be invoked in ways that weaken subgroup accountability pressure as a tool to identify educational inequities. Notably, authorization, which includes appeals to tradition, conformity, and statistical expertise, was the most frequently used discursive legitimation strategy we found across SEAs’ ESSA plans, demonstrating how current quantitative policymaking remains at a distance from the vision set forth by QuantCrit. Future language-based research on the formation of quantitative policy measures could continue to systematically examine specific rationales for decision-making through a QuantCrit lens, and ED or community organizations could use the QuantCrit tenets and methodological guidance to evaluate state policy.

The Importance of Transparency in Decision-Making

A key implication of our study is for SEAs to provide detailed accounts of their decision-making about subgroup composition and n-size, regardless of legitimation strategy used, in line with QuantCrit recommendations for making administrative panel data collection and analysis decisions more transparent (Castillo & Gillborn, 2023; Castillo & Strunk, 2025; Viano & Baker, 2020). In their ESSA plans, SEAs provided a range of detail in their rationales with implications for the level of transparency behind SEAs’ decisions. Often, when SEAs invoked expertise, past practice, or other states’ practices as justifications for their own decisions, they did not provide a clear process for these high-consequence decisions. However, SEAs could have provided a more extended description of similarities to other states’ contexts or the ways in which their prior practices had worked well for them in identifying schools in need of improvement—in these ways, decisions would have more transparency. Including transparent, detailed accounts of decision-making forces SEAs to think intentionally about the creation of quantitative measures and their consequences for social justice, rather than treat quantitative policymaking for subgroup accountability as an objective exercise in compliance to derive the most statistically sound measure. Some promising examples came from states that used moral evaluation in describing an extensive process of gaining stakeholder input or when the state clearly discussed the advantages/disadvantages of different decisions (often associated with instrumental rationales focused on goals or means). Regardless of legitimation strategy used, a detailed description of the decision-making process demonstrates the acknowledgment that quantitative measures are socially constructed, informed by a range of considerations and interests. Transparency offers a more accurate picture of the drawbacks and limitations that come with any decision and of the ways in which quantitative measures for policymaking are constructed by considerations beyond statistical soundness, allowing stakeholders more access to critically engage with the policy process and thereby to potentially perceive it with greater legitimacy and/or advocate for reform.

One decision that requires particular intentionality and transparency is how SEAs decide to aggregate or disaggregate students for accountability purposes. QuantCrit calls for the use of fine-grained categories that capture the distinct, intersectional experiences of various student groups so as not to erase or misrepresent how students are differently impacted by racism in schools: “grouping, rather than disaggregating, minoritized students can disguise areas of greater inequity” (Castillo & Gillborn, 2023, p. 9). This position suggests that the use of super subgroups is problematic in its aggregation of students because it may mask divergences in student experiences and outcomes. In line with this perspective, much of the controversy around super subgroups has focused on how they obscure performance gaps, hiding disparities that would only be visible with additional disaggregation (Wang, 2020). However, we found that some SEA officials discursively made the case that their super subgroups attended to state demographics in ways that helped them identify schools in need of focused attention that would otherwise not be held accountable for inequitable academic outcomes. Vermont, for example, with a relative lack of racial diversity and smaller schools, noted that using standard subgroups would “mean that student groups are too small to show data which might point to inequities in experience” (p. 14). Instead, Vermont proposed including the “Historically Marginalized” subgroup that incorporates multiple historically marginalized dimensions of students’ identities to enhance subgroup accountability pressure. In this case, the QuantCrit tenet to use quantitative methods to promote social justice comes into tension with its practical guidance to disaggregate to more precisely reflect disparate student experiences and outcomes. Accordingly, providing specific and transparent justifications that recognize this tension and explore the advantages and disadvantages of aggregating student groups for identifying educational inequities gains increased importance when electing to use super subgroups.

SEAs who do make the case that super subgroups actually enhance the power of subgroup accountability to identify educational inequities may balance their decision by disaggregating groups or selecting a smaller n-size for the purposes of reporting. As ESSA allows states to select different subgroups and n-sizes for accountability and reporting purposes, disaggregating super subgroups for reporting purposes could help provide more precise insight into which students are being underserved in schools, although these findings would not entail the force of accountability pressure on schools. As we analyzed the ESSA plans, we did note that some SEAs selected a smaller n-size for reporting than for accountability purposes; however, these choices were primarily framed as an effort to balance public transparency with reliability that would minimize volatility in school citations from year to year. Citing this balance between transparency and reliability, both Michigan and North Carolina adopted an n-size of 10 for reporting and an n-size of 30 for accountability purposes. Here we see that a concern with consistency over time for school staff is associated with the adoption of the highest n-size across states, loosening accountability pressure on schools. With a QuantCrit approach, SEAs could similarly make distinct decisions for reporting and accountability when adopting super subgroups, although they should make clear that this is to balance concerns for social justice (by increasing subgroup accountability pressure on schools) and the precise identification of educational inequities rather than prioritizing the interests of school administrators and educators. Castillo and Gillborn (2023) recommend that those who want to take a QuantCrit approach to quantitative data “advocate for more detailed data collection that reflects the community of interest. This way, when it is time for analyses, the researcher can make a decision as to whether the sample is large enough for disaggregation, and thus not mask inequities” (n.p.). Super subgroups may serve as an initial screen to identify schools where closer attention from school and potentially state officials into the reported performance of smaller student groups is warranted. Having the data in both forms, disaggregated for reporting and aggregated for accountability purposes, allows SEAs or others interpreting this data to aggregate as appropriate.

Participation in Decision-Making

Whose interests are favored in subgroup accountability policy is closely related to participation in the decision-making process. QuantCrit scholars strongly encourage the involvement of communities described or impacted by quantitative measures (Castillo & Gillborn, 2023). Decision-making should incorporate voices of color so that the development, use, and interpretation of quantitative measures will be informed by the “experiential knowledge of marginalized groups” (Gillborn et al., 2018, p. 158). While some SEAs explained in detail the public input they received from different stakeholders and how it was weighted in their ultimate decision (e.g., Vermont), others gave cursory mention of public input and instead emphasized their reliance on statistical experts for their n-size and subgroup composition decisions. For subgroup composition, participation helps ensure that “categorization is informed and resonates with the communities of interest” (Castillo & Gillborn, 2023, p. 8). SEAs that emphasized how their decision-making aligned with state demographics demonstrated some consideration of the relevance of their categories, although the direct participation of community members would better ensure their resonance. SEAs can include community stakeholders of color on committees that determine subgroup composition and n-size, as well as institute a robust and accessible process for public input into proposed decisions. A procedure for feedback helps make sure that the categories utilized have resonance within the community and therefore are more likely to align with distinct student experiences within schools. Community and civil rights organizations may utilize QuantCrit to publicly assess states on their structures for participation in the policymaking process. Moreover, they may increase their influence on subgroup accountability decision-making by leveraging the power of expert authority by framing themselves as the holders of particular expertise on subgroup composition on which SEAs must rely.

Conclusion

Subgroup accountability policy is one of few federal levers that aims to promote racial equity in U.S. schools. As a quantitative policy measure, it has critical implications for which student groups are constructed as (dis)advantaged and for which schools are perceived as inequitable, as

governments do not use numbers merely to describe the world, they increasingly use statistics as an essential part of the technology by which they seek to re/shape educational systems. In this way, numbers play a key role in how inequality is shaped, legitimized, and protected . . . Numbers are increasingly used to justify policy priorities and to label teachers, schools, districts, and even entire countries, as educational successes and failures (Gillborn et al., 2018, p. 161)

Given the ongoing prominence of accountability in K–12 education policy, the creation and consequences of quantitative measures related to subgroups calls for applying a QuantCrit lens to qualitative data such as state ESSA plans. Our findings insist that researchers and policymakers who use quantitative data from subgroup accountability policy take into account the decision-making processes behind states’ subgroup composition and n-size selections. A deeper understanding of the people involved and the priorities given the most consideration contextualizes the data and indicates limitations for its use.

Through a QuantCrit lens and language-based methodologies that reveal the multiple and subjective considerations behind decisions regarding subgroup composition and n-size, we call for greater transparency and more robust participation in state policymaking. We do not outright reject subgroup accountability, and we see ourselves as aligned with Fong and Irizarry (2025), who note that “a wholesale rejection of calculating mean differences would be misguided. Mean differences between racial groups exposing disparities in educational opportunities as a result of structural racism can be a powerful way to promote anti-racist evidence” (pp. 5–6). As states continue to refine their data collection and analysis decisions in service of accountability requirements and look towards another potential ESEA reauthorization in the coming years, this paper calls for rationales about how to identify schools in need of improvement and student subgroups disadvantaged in schools that are not only “statistically sound,” but, more importantly, aligned with the tenets and methods of QuantCrit to better utilize policy for understanding and addressing how racism operates in schools. Subgroup composition and n-size decision-making reveals the sociopolitical context of educational policymaking under an accountability regime overwhelmingly concerned with statistical validity and reliability, yet in combination with a range of additional considerations. As states experience increased autonomy and decreased federal oversight under the second Trump administration, it is vital to examine the ways in which states negotiate various considerations in their policymaking.

Footnotes

Declaration of Conflicting Interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) received no financial support for the research, authorship, and/or publication of this article.

ORCID iDs

Rachel Garver

Emily M. Hodge

Note: This manuscript was accepted under the editorial team of Kara S. Finnigan, Editor in Chief.

Authors

RACHEL GARVER is an associate professor of educational leadership at Montclair State University, 1 Normal Avenue, Montclair, NJ, 07043, USA; garverr@montclair.edu. Her research examines the implementation of equity-oriented policies, school discipline and safety, and the preparation of educational leaders committed to social justice.

EMILY M. HODGE is an assistant professor of educational leadership at Virginia Commonwealth University, 1015 West Main St., Richmond, VA 23284, USA; hodgeem@vcu.edu. Her research uses qualitative methods and social network analysis to uncover the larger forces shaping curriculum policy and politics.

References

Alliance for Excellent Education. (2016, June). Ensuring equity in ESSA: The role of n-size in subgroup accountability. Alliance for Excellent Education. https://mk0all4edorgjxiy8xf9.kinstacdn.com/wp-content/uploads/2016/06/NSize_UPDATED_Nov2018.pdf

Baker

D. J.

Ford

K. S.

Viano

Johnston-Guerrero

M. P.

(2022). Racial category usage in education research: Examining the publications from AERA journals (EdWorkingPaper No. 22–596). Annenberg Institute at Brown University.

Booher-Jennings

(2005). Below the bubble: “Educational triage” and the Texas accountability system. American Educational Research Journal, 42(2), 231–268. https://doi.org/10.3102/00028312042002231

Campbell-Montalvo

R. A.

(2020). Being QuantCritical of US K–12 demographic data: Using and reporting race/ethnicity in Florida Heartland Schools. Race Ethnicity and Education, 23(2), 180–199. https://doi.org/10.1080/13613324.2019.1679748

Campbell-Montalvo

(2021). Linguistic re-formation in Florida heartland schools: School erasures of indigenous Latino languages American Educational Research Journal, 58(1), 32–67. https://doi.org/10.3102/0002831220924353

Castillo

Gillborn

(2023). How to “QuantCrit:” Practices and questions for education data researchers and users (EdWorkingPaper No. 22–546). Annenberg Institute at Brown University.

Castillo

Strunk

K. K.

(2025). How to QuantCrit: Applying Critical Race Theory to quantitative data in education. Taylor & Francis.

Covarrubias

Vélez

V. N.

(2013). Critical Race Quantitative Intersectionality: An anti-racist research paradigm that refuses to “let the numbers speak for themselves.” In Lynn

Dixson

A. D.

(Eds.), Handbook of Critical Race Theory in education (pp. 270–285). Routledge.

Chu

(2019). What are they talking about when they talk about equity? A content analysis of equity principles and provisions in state Every Student Succeeds Act plans. Education Policy Analysis Archives, 27(158), 1–30. https://doi.org/10.14507/epaa.27.4558

10.

Delgado

Stefancic

(2001). Critical race theory: An introduction. NYU Press.

11.

DiMaggio

P. J.

Powell

W. W.

(1983). The iron cage revisited: Institutional isomorphism and collective rationality in organizational fields. American Sociological Review, 48(2), 147–160. https://doi.org/10.2307/2095101

12.

Du Bois

W. E. B.

(1899). The Philadelphia Negro: A social study. University of Pennsylvania Press.

13.

Duff

Wohlstetter

(2019). Negotiating intergovernmental relations under ESSA. Educational Researcher, 48(5), 296–308. https://doi.org/10.3102/0013189x19854365

14.

Fong

C. J.

Irizarry

(2025). Too quant to crit? Advancing QuantCrit methodologies in educational psychology. Contemporary Educational Psychology, 80(2), 102341. https://doi.org/10.1016/j.cedpsych.2025.102341.

15.

Garcia

N. M.

López

Vélez

V. N.

(2018). QuantCrit: Rectifying quantitative methods through critical race theory. Race Ethnicity and Education, 21, 149–157. https://doi.org/10.1080/13613324.2017.1377675

16.

Gillborn

Warmington

Demack

(2018). QuantCrit: Education, policy, “Big Data” and principles for a critical race theory of statistics, Race Ethnicity and Education, 21(2), 158–179. https://doi.org/10.1080/13613324.2017.1377417

17.

Heilig

J. V.

Darling-Hammond

(2008). Accountability Texas-style: The progress and learning of urban minority students in a high-stakes testing context. Educational Evaluation and Policy Analysis, 30(2), 75–110. https://doi.org/10.3102/0162373708317689

18.

Irizarry

(2015). Utilizing multidimensional measures of race in education research: The case of teacher perceptions. Sociology of Race and Ethnicity, 1(4), 564–583. https://doi.org/10.1177/2332649215580350

19.

Joutsenvirta

(2013). Executive pay and legitimacy: Changing discursive battles over the morality of excessive manager compensation. Journal of Business Ethics, 116(3), 459–477. https://doi.org/10.1007/s10551-012–1485–1

20.

Kenny

C. T.

Kuriwaki

McCartan

Rosenman

E. T.

Simko

Imai

(2021). The use of differential privacy for census data and its impact on redistricting: The case of the 2020 US Census. Science Advances, 7(41), 1–17. https://doi.org/10.1126/sciadv.abk3283

21.

Klein

(2016a, March 31). The Every Student Succeeds Act: An ESSA overview. Education Week. https://www.edweek.org/ew/issues/every-student-succeeds-act/index.html

22.

Klein

(2016b, December 13). Final ESSA rules flesh out accountability, testing details. Education Week. https://www.edweek.org/ew/articles/2016/12/14/final-essa-rules-flesh-out-accountability-testing.html

23.

Klein

(2017, March 13). Trump education dept. releases new ESSA guidelines. Education Week Politics K–12 Blog. https://blogs.edweek.org/edweek/campaign-k-12/2017/03/trump_education_dept_releases_new_essa_guidelines.html

24.

Lester

J. N.

Lochmiller

C. R.

Gabriel

R. E.

(2017). An introduction to discursive perspectives on education policy and implementation. In Lester

J. N.

Lochmiller

C. R.

Gabriel

R. E.

(Eds.), Discursive perspectives on education policy and implementation (pp. 1–16). Palgrave Macmillan.

25.

Mahiri

(2017). Deconstructing race: Multicultural education beyond the color-blind. Teachers College Press.

26.

Mann

(2018, July 26). And then there was one: 3 things to know about Florida’s stalled education plan—and why the state could be risking $1 billion in federal funds. The 74. https://www.the74million.org/article/florida-essa-plan-billions-funds/

27.

Miles

M. B.

Huberman

A. M.

Saldaña

(2014). Qualitative data analysis: A methods sourcebook (3rd ed.). SAGE.

28.

Prewitt

(2018). The census race classification: Is it doing its job? ANNALS of the American Academy of Political and Social Science, 677(1), 8–24. https://doi.org/10.1177/0002716218756629

29.

Rentner

D. S.

Chudowsky

Fagan

Gayler

Hamilton

Kober

Pinkerton

Jennings

Scott

Buell

(2003). From the capital to the classroom: State and federal efforts to implement the No Child Left Behind Act. Center on Education Policy. https://files.eric.ed.gov/fulltext/ED474010.pdf

30.

Saldaña

(2015). The coding manual for qualitative researchers (3rd ed.). SAGE.

31.

Suddaby

Greenwood

(2005). Rhetorical strategies of legitimacy. Administrative Science Quarterly, 50(1), 35–67. https://doi.org/10.2189/asqu.2005.50.1.35

32.

Ujifusa

(2016, June 21). ESSA means the end of how these states use “super subgroups” for accountability. Education Week Politics K–12 Blog. https://blogs.edweek.org/edweek/campaign-k–12/2016/06/ESSA_super_subgroups_accountability_changes.html

33.

Ujifusa

(2017, July 7). States bristle as DeVos ed. dept. Critiques their ESSA plans. Education Week. https://www.edweek.org/ew/articles/2017/07/19/states-bristle-as-devos-ed-dept-critiques.html

34.

Umansky

I. M.

Porter

(2020). State English learner education policy: A conceptual framework to guide comprehensive policy action. Education Policy Analysis Archives, 28(17), 1–40. https://doi.org/10.14507/epaa.28.4594

35.

Vaara

Tienari

Laurila

(2006). Pulp and paper fiction: On the discursive legitimation of global industrial restructuring. Organization Studies, 27(6), 789–813. https://doi.org/10.1177/0170840606061071

36.

Van Leeuwen

(2007). Legitimation in discourse and communication. Discourse & Communication, 1(1), 91–112. https://doi.org/10.1177/1750481307071986

37.

Viano

Baker

D. J.

(2020). How administrative data collection and analysis can better reflect racial and ethnic identities. Review of Research in Education, 44(1), 301–331. https://doi.org/10.3102/0091732x20903321

38.

Viano

Baker

D. J.

Ford

K. S.

Johnston-Guerrero

M. P.

(2024). A latent class analysis of racial terminology in education research: Patterns of racial classifications in AERA journals. Educational Researcher, 53(4), 213–222. https://doi.org/10.3102/0013189x241227901

39.

Wang

(2020). Understanding congressional coalitions: A discourse network analysis of congressional hearings for the Every Student Succeeds Act. Education Policy Analysis Archives, 28(119), 1–34. https://doi.org/10.14507/epaa.28.4451

40.

Zuberi

(2001). Thicker than blood: How racial statistics lie. University of Minnesota Press.

41.

Zuberi

Bonilla-Silva

(Eds.). (2008). White logic, white methods: Racism and methodology. Rowman & Littlefield.

Complicating “Statistical Soundness”: How States Legitimize Subgroup Composition and N-Size Decisions in ESSA Plans

Abstract

Keywords

Introduction

Ongoing Debates about ESSA Plan Elements

Theoretical Framework

Methods

Federal Guidance and the State ESSA Plan Approval Process

Data Collection and Analysis

Findings

State Decision-making About Subgroup Composition

Legitimizing Subgroup Composition with the Authority of Tradition

Legitimizing Subgroup Composition Through Instrumental Rationalization

Means-oriented strategies: Major subgroups

Effects-oriented strategies: Additional subgroups

Effects-oriented strategies: EL grouping and super subgroups

State Decision-Making About N-Size

Legitimizing N-Size Through Authorization

The authority of tradition: Past practice

The authority of conformity: Reference to practices of other SEAs

Expert authority: Invoking statistical expertise

Legitimizing N-Size Through Instrumental Rationalization

Goals-Oriented Instrumentality: Protect Student Privacy

Means-Oriented Instrumentality

Statistically Sound

Appropriate for School Demographics

Legitimizing N-Size Through Moral Evaluation

Moral Evaluation: Demonstrating Inclusivity

Moral Evaluation: Stakeholders Consulted

Discussion

QuantCrit Implications for Using Discursive Legitimation Strategies

The Importance of Transparency in Decision-Making

Participation in Decision-Making

Conclusion

Footnotes

Declaration of Conflicting Interests

Funding

ORCID iDs

Authors

References