Abstract
Validity studies of college admissions tests have found that, on average, students who are Black or Hispanic earn lower freshman grade-point averages (FGPAs) than predicted by these test scores. This differential prediction is used as a measure of bias. These studies, however, conflate student and school characteristics. The differential prediction affecting minoritized groups may arise in part because they attended high schools in which college enrollees, regardless of race, perform worse than predicted. Using data on students who graduated from New York City public high schools in 2011 and enrolled in the City University of New York, we examined this using college admissions and high school test scores. There was no differential prediction based on race/ethnicity among students within high schools when school characteristics were accounted for. Instead, overprediction of FGPA was associated with the school proportion of enrolled Black and Hispanic students. Overprediction was larger in models with high school test scores.
Keywords
Introduction
Several decades of validity studies of college admissions tests have shown that test scores over- or underpredict the average freshman year grade point average (FGPA) for some student subgroups defined by race/ethnicity. We refer to both over- and underprediction as differential prediction. These studies have typically found that when college admissions test scores are used alone or in combination with high school grade point average (HSGPA) to predict FGPA, Black and Hispanic students’ FGPAs are moderately overpredicted—that is, their average FGPAs are lower than predicted—and Asian students’ FGPAs are slightly underpredicted (Bridgeman et al., 2000; Mattern et al., 2008; Young, 2001).
Differential prediction of FGPA is of concern for two reasons. First, underprediction may be a form of bias, distorting decisions about admission. Second, student subgroups whose performance is overpredicted may underperform relative to expectations in college, which could jeopardize college completion.
These validity studies, however, have used student-level ordinary least squares (OLS) regression models, and therefore, what appears to be differential prediction associated with individual characteristics may, in whole or in part, reflect characteristics of the high schools students attend. American high schools are highly segregated by race and income, and they vary along many dimensions of quality that may contribute to differential prediction (Owens et al., 2016; Reardon et al., 2000). Schools serving predominantly Black, Hispanic, and/or low-income students rank lower on numerous measures, including teacher quality, access to advanced coursework, and support for applying to college (Clotfelter et al., 2005; Goldhaber et al., 2015; Klopfenstein, 2004; Lankford et al., 2002; Roderick et al., 2008). Differences in school quality can impact skills and knowledge measured by tests, but they may also affect other knowledge, skills, and dispositions that are not reflected in test scores but that may influence college performance. For example, lack of access to challenging coursework in high school may leave students underprepared for the coursework and assessment demands of college. Lack of guidance from counselors and teachers in the college application process may make it difficult for students to find a college that is a good match academically, socially, and financially, and a poor match may depress performance in college. Finally, there may be other factors, such as the college aspirations of students’ peers or school resources, such as technology, workspace, or library resources, that vary among high schools and that influence students’ college performance.
An accurate understanding of the correlates of students’ scores is important for their appropriate use and has implications for policy and practice. Critics of these exams have argued that they are biased against racially minoritized students, and over the last two decades, many colleges and universities have made admissions tests optional or stopped requiring them altogether. This trend accelerated during the COVID-19 pandemic. Nonetheless, many higher education institutions still require them and use them for scholarship and course placement decisions in addition to admissions decisions. Furthermore, after the COVID-19 pandemic, several elite institutions have reinstated the requirement, potentially suggesting that the trend away from admissions test requirements may be reversing.
To address this gap in the literature, we examined whether differential prediction of FGPA from test scores can be explained by high-school-level aggregate characteristics above and beyond student characteristics for historically disadvantaged students (defined as Black, Hispanic, or from low-income families) using the records of students who graduated New York City public high schools and attended a four-year program at the City University of New York (CUNY). In this study, the school-level characteristics we use are the aggregate characteristics of CUNY enrollees from a high school. We also compared the patterns of differential prediction exhibited by two different types of tests: the SAT, a college admissions test, and two New York state Regents high school exams required for high school graduation. While college admissions tests, such as the SAT and ACT, have traditionally been used to assess college readiness, there is interest in using state-mandated high school tests to assess whether students are “college and career ready.” Currently, Regents exam scores are one criterion used to determine admission to four-year degree programs in the CUNY system (City University of New York [CUNY], 2017). Moreover, it is important to ascertain whether differential prediction is common to achievement tests or attributable to specific characteristics of college admissions tests. Given differences in both content and preparation practices between the two types of assessments, the partitioning of differential prediction between the student and school levels may differ between the two tests.
Background
Within and Between School Effects
In the validity literature, researchers have usually measured differential prediction by fitting separate OLS regression models for each postsecondary institution, calculating average residuals for specific student subgroups, and then aggregating across participating institutions (e.g., Bridgeman et al., 2000; Mattern et al., 2008). For example, a positive mean residual would indicate that, on average, performance has been underpredicted for that group. If each subgroup had mean residuals of zero, then there would be no differential prediction. Typically, the predictor variables in the OLS model include college admissions test scores and HSGPA, separately or jointly, and the most common outcome used as a criterion is FGPA. While estimating models separately by a postsecondary institution does allow for differences among them, these studies do not take into account differences among students’ high schools.
Recognizing that high school quality is likely associated with student performance in college, Pike and Saupe (2002) compared three analytical approaches to predicting FGPA from test scores: single-level regression models, models that include fixed effects for high schools, and two-level mixed models. They found that models that accounted for high schools (i.e., fixed-effects or mixed models) were more accurate predictors of FGPA in the aggregate. They also found that including school characteristics in the two-level mixed models improved prediction. However, the authors did not include the means of student demographics in their mixed models and did not investigate whether school characteristics contribute to explaining differential prediction by race/ethnicity.
Koretz and Langi (2018) compared the relative magnitude of the associations between academic achievement measures and FGPA within and between high schools using two-level mixed models. Using data from CUNY freshmen, the authors found that when HSGPA and test scores were jointly used to predict FGPA, as they typically are in admissions, HSGPA was a stronger predictor within high schools than between, while test scores predicted more strongly between schools than within. Their results, which have been replicated by Allensworth and Clark (2020) using data on Chicago Public Schools students, suggest that including test scores in the prediction may adjust for differences in grading standards between high schools.
Thus, while a few studies have used multilevel models to identify school effects in the prediction of college performance, none have addressed questions related to differential prediction by race/ethnicity.
High School Tests and College Admissions Tests
While several prior studies have compared the overall predictive value of high school and college admissions tests (e.g., Cimetta et al., 2010; Coelen & Berger, 2006, cited in Cimetta et al., 2010; McGhee, 2003), only one study, to our knowledge, has compared differential prediction by race/ethnicity and income status between the two types of tests. Using data from New York and Kentucky, Koretz et al. (2016) found small differences between the two tests in the strength of aggregate prediction when each of the tests was used jointly with HSGPA to predict FGPA. There were only trivial differences in differential prediction between the high school and college admissions tests for students of different race/ethnicity and income groups. The authors, however, used only single-level ordinary least squares regression without school-level variables.
Even though Koretz et al. (2016) found that differential prediction was not sensitive to the choice of test when a simple student-level model was used, there are reasons to expect that differences may be found when differential prediction is partitioned into within- and between-school components. Differences in the content and difficulty of the exams may plausibly lead to differences in the partitioning of differential prediction between levels, and the content of the Regents exams, which are aligned to state standards, is less difficult than that of college admissions tests. The SAT is intended to be a more general measure of college preparation. Research has not yet indicated whether the difficulty of test content is associated with the magnitude of differential prediction.
The stakes associated with each set of exams and the locus of test preparation practices might also create differences in within- and between-school differential prediction between the two tests. Decisions to participate in college admissions test preparation are typically made by students and families, and higher-income and Asian students disproportionately participate in SAT test preparation (Buchmann et al., 2010; Domingue & Briggs, 2009; Mbekeani, 2023). In contrast, decisions about test preparation practices for high-stakes K–12 tests are typically made at the classroom and school level, and studies have shown that test preparation is, on average, more extreme in schools with more minoritized students (Diamond & Spillane, 2004; Jacob et al., 2004; Jennings & Sohn, 2014). Given these differences, it may be the case that student-level differential prediction is greater for college admissions tests, and school-level differential prediction is greater for high school tests.
In the present study, we examine the differential prediction of FGPA from two tests, the Regents exams and the SAT, for students who are Black and Hispanic and from low-income families enrolled in CUNY and who graduated from New York City public high schools. We hypothesized the following:
When FGPA is estimated with a two-level model, we expect patterns of student-level overprediction consistent in direction with the prior literature but different in magnitude. The magnitude of the overprediction will differ between the SAT and Regents tests.
We expect that there will be school-level overprediction of FGPA based on the proportion of students who are Black and Hispanic and from low-income families. We hypothesize that the between-school overprediction will be directionally consistent but differ in magnitude between the two tests.
Data
Sample
The data for this study were provided by CUNY and the New York State Education Department (NYSED). The data include one cohort of students (N = 11,482) who graduated from New York City (NYC) public high schools in 2011 and enrolled as freshmen in 2011 or 2012 in CUNY institutions that offer bachelor’s degrees. Approximately three-quarters of first-time freshmen at CUNY are from NYC public high schools (CUNY, 2012). The focus of this study is on students enrolled in four-year programs, the students for whom college admissions tests are relevant. Eleven CUNY institutions grant bachelor’s degrees: eight senior colleges, which offer only four-year degrees, and three comprehensive colleges, which offer both two-year and four-year degrees. 1 The data do not indicate which students in comprehensive colleges are in two-year programs, but as explained later, it is likely that most of the students in two-year programs at these institutions were dropped from our sample.
We imposed a number of restrictions to identify the relevant sample. First, we excluded students missing scores on the SAT or Regents exams or HSGPA. The most common missing variables were SAT scores (10% of the sample). From 21% to 33% of students at the comprehensive colleges were missing these scores, compared with fewer than 1% to 4% in the senior colleges. By excluding students missing SAT scores, we likely dropped students in two-year programs because SAT scores are not required for them. The share of students missing HSGPA was approximately 2%, and the share of students missing regents ELA and mathematics scores were 1% and 2%, respectively. We call the result the full sample; it included 9,971 students.
Approximately 20% of the sample was missing race/ethnicity information and low-income status (2,122 students). These variables came from NYSED eighth-grade Regents files, and the records of those students could not be matched to records in the NYSED data. There are several possible reasons for this failure to match, the most likely of which is that these students did not attend public school in New York state in eighth grade. Given that this is a substantial number of students, we imputed missing values in race/ethnicity and income status using multiple imputation methods described in greater detail in the online Supplemental Materials. We tested the sensitivity of our results to this imputation, and we found that they were not sensitive to the imputation of missing data. We present the imputed results as our main results and include unimputed results in the online Supplemental Materials, Table S1. Finally, we dropped students whose racial or ethnic group includes fewer than 50 students (<1% of the sample).
To create our analytic sample, we further restricted the sample to students in high schools with at least eight students who went on to a four-year program at CUNY, resulting in an analytic sample of 9,252 students. School sample sizes ranged from 8 to 317. We implemented this restriction to ensure sufficient degrees of freedom within high schools because we estimated as many as seven parameters within schools. 2 The full and analytic samples include 87% and 81% of the original 2011 cohort, respectively. As shown in Table 1, high schools in the analytic sample had higher mean FGPAs, SAT scores, and Regents scores than the full sample of high schools. However, there are few differences between the full and analytic samples in terms of student demographic characteristics.
Descriptive Student and School-Level Statistics for Full and Analytic Samples
Note. Standard deviations in parentheses. The full sample includes all students with nonmissing SAT and Regents scores. The descriptive demographic proportions are calculated among students with nonmissing values. The analytic sample includes students from high schools with eight or more observations. ELA = English language arts; FGPA = freshman grade point average; HSGPA = high school grade point average.
Measures
The outcome measure is FGPA at a four-year CUNY institution, which was on a 0 to 4 scale and weighted by the number of credits associated with the course. Non-credit-bearing remedial courses were not included in the CUNY FGPA measure. As with extant validity studies, we rely on institution-provided measures. In addition, for the purposes of assessing the validity of college admissions tests, only college-level courses that progress students toward a credential are relevant. The distribution of FGPA is non-normal, with a left skew and a spike at zero (see Figure 1). An FGPA equal to zero indicates a grade of F in all courses. Students with a zero FGPA appeared dissimilar to other students because the distribution of FGPAs has only a small number of cases just above zero. On average, students with FGPAs equal to zero had taken fewer courses (3.4) and attempted fewer credits (9.2) than other students in the analytic sample (9.5 courses and 31.1 credits). We excluded these students and tested the robustness of our results to this exclusion.

Distribution of CUNY weighted freshman grade point average
Our predictors include HSGPA, composite SAT scores, and composite New York state Regents scores. The HSGPA variable was calculated by CUNY using only courses determined to be “college preparatory” and is on a scale of 50 to 100. CUNY uses this measure for admissions purposes. This measure of HSGPA differs from those used in other studies in ways that likely make it a stronger predictor of FGPA. First, the CUNY HSGPA was based on academic courses only and not all the high school courses a student has taken, as is the case in other studies. Second, the HSGPA in this study came from students’ high school transcripts and was not self-reported, which likely reduces measurement error and possibly bias. We formed composite SAT and Regents scores using students’ highest scores, which were the only scores available in the data. 3 Students’ SAT scores are on the usual scale of 400 to 1600 (200 to 800 for each subject). The SAT composite was formed using the mathematics and critical reading (CR) scores. The Regents composite was formed using a Regents mathematics exam score and the English language arts (ELA) score. Regents scores range from 0 to 100. A score of sixty-five is considered passing and is typically treated as the required minimum score for admissions to a four-year CUNY college, but a very small number of students in the analytic sample had scores below this value (0.5% of students on Regents ELA and 1% of students on Regents math). We standardized the academic achievement measures to facilitate comparison of their relative contribution to the prediction of FGPA. For the SAT and Regents composite scores, we standardized the mathematics and CR/ELA scores separately to equalize their weight in the composite and then standardized the sum.
While this cohort was in high school, there was a transition in Regents mathematics exams. The Regents Integrated Algebra (IA) exam was first administered in June 2008, and the Mathematics A exam it replaced was last administered in January 2009. Students in the 2011 cohort had the opportunity to take either exam, and some took both. The modal test for the 2011 cohort was the IA exam, which was taken by 73% of the analytic sample. The Regents mathematics score we used was a student’s IA exam score, if available, and the Mathematics A score if the student did not have an IA score. Koretz et al. (2016) found no difference in the predictive power of the two mathematics tests.
We included as predictors indicators of student race/ethnicity and whether a student came from a low-income family. There were four racial/ethnic groups (African American or Black, Asian, Hispanic, and white) with sufficient numbers to be included in the analyses. The low-income indicator was defined by NYSED and denotes whether a student participated in the free- and reduced-price school lunch program or another economic assistance program, including, for example, Temporary Assistance for Needy Families (TANF) or food stamps (University of the State of New York, 2011, p. 244).
The school-level predictors are the means of all student-level predictors taken over all CUNY students who had been enrolled in a given high school. Note that these were not means taken across all students enrolled in a given high school. For the purposes of this study, using sample aggregates is appropriate for disentangling the differential prediction between levels. Whole-school means, which were not available for the SAT because not all students take the exam, nor for HSGPA because the data did not include the HSGPA of students not enrolled in CUNY, would not allow us to disentangle the prediction. Because students who did not enroll in CUNY are likely different than those who did, we expect that the analytic-sample means differ from whole-school means.
Methods
We first fitted a series of single-level OLS regression models to ascertain whether our data exhibited the typical patterns of differential prediction found in previous validity studies of the SAT. We estimated three single-level OLS regression models predicting FGPA from (1) HSGPA and demographics; (2) HSGPA, SAT scores, and demographics; and (3) HSGPA, Regents scores, and demographics. 4
The OLS models were of the following form:
The outcome,
This approach is substantively similar to the conventional approach estimating differential prediction by calculating mean residuals from an overall regression model without demographic variables, and the two approaches yield similar results. 5 Positive coefficients in our models correspond to underprediction (i.e., positive residuals) in a traditional model without these variables, and negative coefficients indicate overprediction (i.e., negative residuals). If there were no differential predictions, we would find coefficients of zero on the effect-coded demographic measures. We used effect coding to identify differential prediction because the approach is more easily applied to two-level models, where the interpretation of residuals becomes more difficult.
As our primary models, we estimated a series of two-level mixed models to examine the differential prediction of FGPA by both individual and high school demographic characteristics. The level-1 model predicts the FGPA of student
where all variables are the same as in the OLS models (1) but measured within the school. We grand-mean center all the level-1 predictors in order to obtain level-2 estimates of context effects, which we describe in further detail later. The two-level models do not include an indicator of low-income status, as the coefficient on this variable was close to zero and nonsignificant. 6 The equation for the level-2 model is:
where the predictors are the analytic-sample means of the level-1 predictors. The two-level models are random-intercept, fixed-slope models; the school-mean FGPA is allowed to vary between schools but the within-school slope coefficients are fixed across high schools. The combined model is then:
Because these are random-intercept fixed-slope models, the inclusion of high school means at level 2 results in level-1 parameters that are estimates of the within-school, between-student effects (Enders & Tofighi, 2007; Kreft et al., 1995). These level-1 effects are nearly identical to what would be obtained from using a high school fixed-effects specification that captures only the within-school variation (Allison, 2009). For a comparison of within-school estimates from random-effects and fixed-effects specifications, see Table S2.
Given that we grand-mean center the level-1 variables, the level-2 parameters are context effects; that is, they represent the association between the aggregate school-level variables and FGPA after controlling for the individual-level association (Enders & Tofighi, 2007; Raudenbush & Bryk, 2002). We note that we use the term “effect” according to the terminology of multilevel models without any causal implication (e.g., see, Raudenbush & Bryk, 2002). When a context effect is present, there is an expected difference in outcome for two students with identical individual values but who attended high schools that differed in their mean value on a given predictor (Raudenbush & Bryk, 2002). Therefore, the coefficients
To examine differences in the magnitude of coefficients between models with different test scores, we fitted a fully interacted OLS regression model using a dummy variable to indicate which test was used. 7 We clustered the standard errors in these models by student and school. The parameters of interest were the student- and school-level interactions between the demographic indicators and the indicator of the test used.
We fitted several models with cross-level interactions to examine whether school demographic or achievement characteristics moderated the within-school relationships between students’ own race and FGPA. We conducted a sensitivity analysis to assess the impact of excluding students with FGPA equal to zero.
Results
Descriptive Results
The analytic sample comprised 9,252 students in 216 high schools. In Table 1, we present the average characteristics of the sample. As noted, the average academic and demographic characteristics of the full and analytic samples are nearly identical. The mean SAT scores in the analytic sample were 510 in mathematics and 466 CR, which were similar to the national average on the SAT for the high school graduating class of 2011 in mathematics (mean = 514) and slightly below the national average in CR (mean = 497) (College Board, 2011). The average Regents scores were 79.6 in math and 82.6 in ELA. The average HSGPA was 82.4. In the analytic sample, 33% of students were Asian, 18% Black, 30% Hispanic, and 19% white. Four-fifths of students were from households with low incomes.
The FGPA, HSGPA, and Regents scores of students from low-income families in the analytic sample were similar to that of other students (Table 2). There were, however, appreciable differences in SAT scores, with non-low-income students performing 30 points (or about 0.28 standard deviations) better in math and 39 points (0.41 standard deviations) better in CR than low-income students (math: mean = 504; CR: mean = 459). 8 These differences suggest there will be little differential prediction based on income status in models using Regents scores, although there is more possibility of a difference in models using SAT scores.
Descriptive Statistics for Student Subgroups in Analytic Sample, N = 9,252
Note. The table includes all students in the analytic sample. We use the multiply imputed data to estimate the average academic characteristics across each demographic group. FGPA = freshman grade point average; ELA = English language arts; HSGPA = high school grade point average.
The mean academic performance of Black and Hispanic students was lower than that of Asian and white students. For both Black and Hispanic students, the largest difference from the mean of the analytic sample was in SAT mathematics, with Black and Hispanic students scoring 60 and 39 points below the grand mean, respectively (Table 2).
Single-Level Results
Our single-level models showed patterns of differential prediction based on race/ethnicity similar to those exhibited in other SAT validity studies. On average across the three models presented in Table 3, the FGPA of Black and Hispanic students was slightly overpredicted, whereas that of Asian students was slightly underpredicted. In this sample, the FGPA of white students was also underpredicted, for example, by approximately 0.85 grade points in Model 1. 9 This finding differs from that of some previous SAT validity studies in which white students’ performance was accurately predicted or slightly underpredicted (e.g., Mattern et al., 2008). This difference is due in part to the fact that the racial groups in this sample were more equally balanced than in other SAT validity studies. For students from low-income families, the single-level models revealed little evidence of differential prediction with coefficients close to zero and statistically insignificant, regardless of which test score was included.
Ordinary Least Squares Regression Results Predicting Freshman Grade Point Average
Note. Negative (positive) coefficients indicate overprediction (underprediction). Academic predictor variables are standardized. Demographic variables are effect coded. Models include fixed effects for CUNY institutions. R2 values are obtained using a Fisher's r to z transformation for the multiply imputed data. CUNY = City University of New York; HSGPA = high school grade point average. **p<0.01; ***p<0.001.
These patterns of differential prediction appeared both when HSGPA was the sole academic predictor and when HSGPA was combined with test scores (Models 2 and 3 in Table 3); however, the differential prediction by race/ethnicity was very slightly smaller when test scores were included in the model. For example, the FGPA of Black students was overpredicted by 0.07 grade points in the model that excluded test scores but by 0.05 points in both models that included test scores. The results were largely similar for both tests, although overprediction of Hispanics was a bit larger (0.02) when Regents scores, Model 3, were used. The patterns of differential prediction were similar in the test-only models (see online Supplemental Materials, Table S3), with some evidence of slightly larger differential prediction for Asian and Black students.
Two-Level Results
There was little differential prediction based on individual race/ethnicity among students within schools after accounting for the school-level aggregate characteristics of CUNY enrollees, contrary to our first hypothesis. Accounting for the average academic performance of CUNY enrollees (Table 4, Models 1–3) eliminates most of the student-level differential prediction. Some within-school overprediction, approximately 0.04 grade points, remained for Hispanic students, but within-school between-student differential prediction for Asian and Black students was approximately zero and statistically not significant. Adding the average demographic measures to the model eliminated all of the within-school differential predictions. In Models 4–6 in Table 4, the coefficients on the student-level demographic variables are close to zero and not statistically significant.
Two-Level Random-Intercept Models Predicting Freshman Grade Point Average
Note. Models include fixed effects for CUNY institutions. CUNY = City University of New York; HSGPA = high school grade point average; ICC = intra-class correlation. *p<0.05; **p<0.01; ***p<0.001.
The patterns of differential prediction associated with individual race/ethnicity within schools were similar across the two types of tests. In Table 5, we present the fully interacted model to formally test differences in differential prediction across the two tests. The coefficients on the three level-1 interaction terms in the fully interacted model were trivial and not statistically different than zero. This finding was in contrast to our expectation of differences between the two tests in within-school between-student differential prediction for Black and Hispanic students.
Fully Interacted OLS Model Testing Differences in SAT and Regents Coefficients
Note. Academic predictor coefficients and interactions not shown. Models include fixed effects for CUNY institutions. Standard errors, in parentheses, are clustered at the student by high school level. CUNY = City University of New York. **p<0.01; ***p<0.001.
We found that differential prediction was associated with the aggregate demographic characteristics of enrollees from a given high school, controlling for individual demographics and achievement as well as enrollee mean achievement (Table 4, Models 4–6). The larger the proportion of Black or Hispanic students enrolled in CUNY from a given high school, the greater the overprediction of FGPA; conversely, the higher the proportion of Asian enrollees, the greater the underprediction of FGPA. However, the differential prediction associated with school-level enrollee demographics varied markedly across models. As in the case of the student-level OLS estimates, between-school differential prediction was largest when controlling for mean HSGPA but not for mean test scores (Model 4), with the coefficients of −0.5 and −0.6 on percent Black and Hispanic, respectively. However, unlike the student-level OLS estimates, when mean scores were included, the choice of test mattered. Overprediction associated with percent Black or Hispanic enrollees was smaller (approximately −0.15) and not statistically significant when SAT scores were used. However, it was considerably larger, with coefficients of −0.3 to −0.4 and statistically significant when mean Regents scores were included. In that case, overprediction associated with having attended a high school where the share of Black or Hispanic CUNY-enrolled students is about one standard deviation or 30 percentage points higher was approximately
The lack of within-school main effects left open the possibility that there was within-school differential prediction that varied with school characteristics. However, we found no evidence of this. There were no significant cross-level interactions between individual race/ethnicity and school-level aggregate achievement or demographics.
Finally, it is worth noting that we found substantial context effects for the achievement measures. Net of individual achievement, higher school-mean test scores were associated with higher predicted FGPAs, on average. This indicates that considering only the within-school associations would underestimate the relationship between test scores and FGPA by approximately +0.2 for both SAT and Regents scores. In contrast, net of individual achievement, higher mean HSGPAs were associated with lower predicted FGPAs. Students in high schools with one-unit higher HSGPAs had lower-predicted FGPAs by approximately −0.15 standardized grade points holding all else constant (Table 4). This negative context effect of HSGPA indicates that if one only considered the variation among students within schools, one would overestimate the relationship between HSGPA and FGPA. Including school-mean enrollee demographics in the model did not substantially change these context effects.
Discussion
In this study, we examined whether extant validity studies misattribute differential prediction of FGPA from test scores and HSGPA to student characteristics rather than to the aggregate characteristics of the enrolled students from individual high schools. To do so, we respecified conventional validity models using two-level models that accounted for the nesting of students within school and included the aggregate characteristics of all CUNY-enrolled students from a given high school in the model. We hypothesized that there would be differential prediction for Black and Hispanic students both within and between schools for the two tests we examined. In addition, in response to the widespread debate about the comparative value of admissions tests and states’ own tests, we replicated our analysis using New York state Regents scores instead of SAT scores. We anticipated differences between the two tests in the partitioning of differential prediction between levels.
In contrast to our first hypothesis, we did not find evidence of differential prediction within schools for either test once we accounted for the aggregate academic performance and demographic characteristics of CUNY-enrolled students from a high school in the model. We found some within-school differential prediction for Hispanic students across both tests when only aggregate academic performance measures were included in the two-level model.
Consistent with our second hypothesis, we found differential prediction between schools. Our results indicate that differential prediction was associated with the share of CUNY-enrolled Black and Hispanic students from a high school after controlling for student academic achievement and race/ethnicity and school-mean achievement. The differential prediction associated with individual race/ethnicity evident in our single-level models, which was consistent with prior research, was shown to reflect only school-level differential prediction in the two-level models.
Specifically, we found that the proportion of Black and Hispanic students enrolled in CUNY from a high school was associated with the overprediction of FGPA. The magnitude of the differential prediction associated with school demographics varied depending on which school mean academic measures were included, and in the case of models with SAT scores, the differential prediction was smaller in magnitude and not significant.
We hypothesized that the partitioning of differential prediction at the student and school levels for historically disadvantaged students would differ between the SAT and the Regents exams due to differences between the two exams in content, format, and stakes. While we found no within-school between-student differences between the two tests, we did find greater between-school differential prediction of FGPA for schools with higher proportions of Black and Hispanic enrollees when the model included Regents test scores rather than SAT scores.
While we did not have specific hypotheses related to context effects for the achievement measures, we did find positive context effects for both SAT and Regents exams and negative context effects for HSGPA (when average test scores were included in the model). These findings are consistent with prior studies (Allensworth & Clark, 2020; Koretz & Langi, 2018).
Our findings reinforce the implications of Koretz and Langi (2018), who partitioned the overall predictive power of SAT and Regents tests into within- and between-school components but did not do the same for differential prediction. Together, these studies suggest that both the useful predictive power of the tests and the small biases shown by differential prediction are largely attributable to the aggregate characteristics of applicants admitted from high schools, not to the characteristics of individual students within schools.
Although an increasing number of colleges have stopped using admissions tests or have made them optional, these findings remain important for policy and practice. First, many colleges still use admissions tests for admissions decisions, scholarship decisions, or placement decisions, and a correct understanding of the correlates of scores is important for their appropriate use. Second, these findings weaken common arguments against continued use of admissions tests. The reliance on single-level models led many critics to argue that the increment in student-level predictive power is not large enough to justify using admissions tests, but this study and those by Allensworth and Clark (2020) and Koretz and Langi (2018) show that tests serve to offset predictive differences among high schools. These may reflect between-school differences in academic or grading standards (Koretz & Langi, 2018). Third, as higher education institutions consider increasing their reliance on high school tests for assessing college readiness, our findings suggest that high school tests display more extreme patterns of differential prediction at the school level than college admissions tests.
Our motivation was to evaluate the potential misattribution of differential prediction by the single-level models used in conventional validity studies—that is, to address possible misinterpretation of the data received by postsecondary institutions. Our purpose was not to explore possible causal effects of school characteristics underlying our findings. Moreover, our data—like the data in nearly all validity studies to date—is not well suited to evaluating potential causal effects of school characteristics because it includes only students enrolled in the postsecondary institution. Nonetheless, it is worthwhile to note possible mechanisms that might be explored in future research, in particular in studies that include data on the entire student population of high schools.
Although we have data only from enrolled students, these findings raise the question of what specific school characteristics or processes might explain the differential college performance associated with school proportion of historically disadvantaged and minoritized students. Schools with higher proportions of minoritized students and students growing up in poverty are ranked lower on many measures of school quality other than test scores, including teacher quality, access to advanced coursework, and support for applying to college (Clotfelter et al., 2005; Goldhaber et al., 2015; Klopfenstein, 2004; Lankford et al., 2002; Roderick et al., 2008). The effects we found may be the result of omitted variables related to school resources such as these.
Alternatively, the between-school differential prediction may be capturing omitted differences that are correlated with high school characteristics. These may be the effects of peers. For example, students in under-resourced schools may have social networks with fewer peers that have financial and social supports to help them attend college. Finally, it is possible that the effects we find are the result of unmeasured differences at other levels (e.g., neighborhoods).
The difference in differential prediction between the SAT and Regents tests at the school level may be the result of content differences between the two exams or differences in test preparation practices. Regents exams are more likely to be the focus of coaching efforts on a school-wide basis than the SAT, for which there is likely a great deal of coaching at the individual level. Excessive test preparation and coaching can lead to score inflation, which may, in turn, lead to the greater differential prediction in models, including Regents exam scores among students in high-minority schools if they experience more test preparation than their peers in schools with similar achievement levels but smaller proportion of minority students. While coaching is one possible explanation for the variation in prediction, it is also possible that the different differential prediction of FGPA is the result of other features of the two exams.
We note several limitations of the present study. This study included only students in one university system in one metropolitan area; therefore, the results may not be generalizable to other urban contexts and school systems. Future work should examine whether the results presented here can be replicated in other contexts that include different student populations, different postsecondary systems, and different tests.
Contrary to our expectations, we found no differential prediction based on low-income status either within or between schools, and this null finding may also reflect a limitation of the data. Characteristics of the sample may have made it particularly poorly suited to examining differential prediction by family income. Differences in average performance between low-income and non-low-income students in the sample were small. This may be a result of how students select into CUNY institutions, which we discuss in greater detail later. Furthermore, a finer-grained measure of income may have better captured differential prediction by income in this sample.
An additional set of limitations pertains to data availability. A substantial proportion of the sample was missing demographic data. While our results were robust to the imputation of these missing data, future research should replicate the present analyses with more complete data. The present analyses used school-level aggregate variables derived from the analytic sample rather than from the entire school population because the latter were unavailable. While this is generally the case in validity studies, which are almost always based on analysis of data about enrolled students, it is not ideal for addressing some substantive questions. It is not obvious, however, in which direction these values differ or whether the results would have been appreciably different if we had used whole-school means.
An important avenue for future work would be to address school effects directly by replicating these analyses with whole-school data and additional measures of school characteristics. Such work would be less applicable than our findings to the interpretation of application data by postsecondary institutions, but it would be more valuable for further investigating inequities in schooling.
One final limitation of the design of this study has to do with student selection. The processes by which students self-select into postsecondary institutions may vary by student subgroup. It is possible that the mean differences in unmeasured determinants of FGPA between demographics groups are smaller among students who attend CUNY than they are in the population of NYC public school students or in a broader set of postsecondary institutions. If so, this might contribute to the minimal within-school, between-student differential prediction that we observe. Moreover, the process by which students or student subgroups select into CUNY may induce associations between observed and unobserved variables that differ from the correlations found among the broader population of students. As noted, replication of these analyses with a broader set of institutions would shed light on the generalizability of the results presented here.
Despite these limitations, the findings of this study show that differential prediction studies that do not account for high schools omit an important level of analysis and misattribute differential prediction to student characteristics. The differential prediction of racially minoritized students is due to their disproportionate attendance at high schools in which enrolled students, regardless of race, perform worse in college than predicted.
Supplemental Material
sj-docx-1-ero-10.1177_23328584241245088 – Supplemental material for Differential Prediction for Disadvantaged Students and Schools: The Role of High School Characteristics
Supplemental material, sj-docx-1-ero-10.1177_23328584241245088 for Differential Prediction for Disadvantaged Students and Schools: The Role of High School Characteristics by Preeya P. Mbekeani and Daniel Koretz in AERA Open
Footnotes
Acknowledgements
The authors thank the City University of New York and the New York State Education Department for the data used in this study and Tom Kane and Sasha Killewald for comments on earlier versions of this article. The opinions expressed are those of the authors and do not represent the views of the Institute, the U.S. Department of Education, the City University of New York, the New York State Education Department, the Inequality and Social Policy program, or the Harvard Graduate School of Education.
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: The research reported here was supported by the Institute of Education Sciences, U.S. Department of Education, through Grant R305AII0420 to the President and Fellows of Harvard College and by the Inequality and Social Policy program at the Harvard Kennedy School and the Harvard Graduate School of Education’s Dean’s Summer Fellowship.
1.
In some publicly available information, both senior and comprehensive colleges are referred to as senior colleges (see, e.g., CUNY, 2012, and
).
2.
Small school counts should not bias the results of our two-level models because empirical Bayes or “shrunken” estimates of school effects, which we use, give progressively less weight to the data from schools as within-school counts decrease (Raudenbush & Bryk, 2002). Per McNeish (2017) and
, our sample sizes are appropriate for the models we estimate and are not at risk of biased parameter estimates.
3.
Although one might expect subject-specific scores to predict more strongly than composite scores, previous studies found that the strength of prediction is similar in both single-level and two-level models (Koretz & Langi, 2018; Koretz et al., 2016).
4.
Results of additional models with scores and demographics but not HSGPA are included in the online Supplemental Materials, Table S3.
6.
Results available from the authors upon request.
9.
The differential prediction for white students can be calculated from the coefficients on Asian, Black, and Hispanic:
Authors
PREEYA P. MBEKEANI is a researcher at the American Institutes for Research, 10 S. Riverside Plaza, Suite 600, Chicago, IL 60606;
DANIEL KORETZ is Henry Lee Shattuck Research Professor of Education at the Harvard Graduate School of Education, Gutman Library, 6 Appian Way, Cambridge, MA 02138;
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
