Abstract
Objectives:
Black, Indigenous, and People of Color have borne a disproportionate incidence of COVID-19 cases in the United States. However, few studies have documented the completeness of race and ethnicity reporting in national COVID-19 surveillance data. The objective of this study was to describe the completeness of race and ethnicity ascertainment in person-level data received by the Centers for Disease Control and Prevention (CDC) through national COVID-19 case surveillance.
Methods:
We compared COVID-19 cases with “complete” (ie, per Office of Management and Budget 1997 revised criteria) data on race and ethnicity from CDC person-level surveillance data with CDC-reported aggregate counts of COVID-19 from April 5, 2020, through December 1, 2021, in aggregate and by state.
Results:
National person-level COVID-19 case surveillance data received by CDC during the study period included 18 881 379 COVID-19 cases with complete ascertainment of race and ethnicity, representing 39.4% of all cases reported to CDC in aggregate (N = 47 898 497). Five states (Georgia, Hawaii, Nebraska, New Jersey, and West Virginia) did not report any COVID-19 person-level cases with multiple racial identities to CDC.
Conclusion:
Our findings highlight a high degree of missing data on race and ethnicity in national COVID-19 case surveillance, enhancing our understanding of current challenges in using these data to understand the impact of COVID-19 on Black, Indigenous, and People of Color. Streamlining surveillance processes to decrease reporting incidence and align reporting requirements with an Office of Management and Budget–compliant collection of data on race and ethnicity would improve the completeness of data on race and ethnicity for national COVID-19 case surveillance.
To date, several studies have examined the unequal incidence of disease caused by COVID-19 on Black, Indigenous, and People of Color,1 -4 but little research has focused on the completeness of reporting on race and ethnicity in COVID-19 data.5 -7 Several studies on racial and ethnic disparities in COVID-19 incidence and prevalence note that missingness of complete data on race and ethnicity occurs at both the state and national levels.5 -9 There is a lack of attention in the literature to the tracking and completeness of case, hospitalization, and mortality data for Hispanic people and people of multiple races, as well as a lack of critical review of completeness of Centers for Disease Control and Prevention (CDC) data on race and ethnicity for cases, hospitalizations, and deaths.
In 1997, the Office of Management and Budget (OMB) revised its standards on race and ethnicity to include the 5 racial categories we use today—American Indian/Alaska Native (AI/AN), Asian, Native Hawaiian/Pacific Islander, Black, and White—with the 2 ethnic categories being Hispanic origin or non-Hispanic origin. 10 OMB criteria also hold that data on race and ethnicity should be collected separately and that individuals should be allowed to select more than 1 race when self-identifying their racial identity or identities. For the remainder of this article, we refer to racial and ethnic data that include these 5 racial categories, collect data on race and ethnicity separately, allow for individuals to be represented in data with more than 1 racial identity, and collect ethnicity as “complete.”
While this OMB directive applies to federal agencies reporting on demographic characteristics, the transmission of state surveillance data on COVID-19 to CDC is voluntary. Furthermore, while the Coronavirus Aid, Relief, and Economic Security (CARES) Act Section 18115 11 requires data on race and ethnicity to be recorded by laboratories when they report on COVID-19 test results, there is no enforcement mechanism. As a result, broad variation exists by state as to which racial and ethnic categories are recorded and how they are reported. 12
The impact of these variations on the recording and reporting of race and ethnicity directly decreases the ability of local and national governments to provide accurate surveillance for the most racially and ethnically diverse communities. Per 2020 postcensal estimates, 39.4% of AI/AN people identify as multiracial, and 37.6% of AI/AN people identify as being of Hispanic origin. 13 The collection of only 1 racial identity or the collection of only race or only ethnicity in COVID-19 data likely disproportionately underestimates the incidence of COVID-19 in AI/AN communities relative to more homogenous demographic communities, such as White people (3.3% multiracial, 21.3% Hispanic). 13 Without an appropriate understanding of the quality of data used in the surveillance of these communities, interventions or funding aimed at addressing the disproportionate impacts of COVID-19 may be misguided.
This study describes several elements of data quality on race and ethnicity in person-level, national COVID-19 case surveillance data. First, we wanted to understand the proportion of all COVID-19 cases that were sent to the person-level COVID-19 case surveillance dataset with complete information on race and ethnicity. Second, we examined the percentage of COVID-19 cases with complete data on race and ethnicity compared with the CDC aggregate COVID-19 case counts to better understand the generalizability of the person-level COVID-19 dataset to all known COVID-19 cases. 14 CDC aggregate COVID-19 case counts are collected from jurisdictions and provide the most up-to-date numbers on cases and deaths but do not contain specific information on demographic characteristics of individual cases, which we contrast in this study with case numbers from person-level case surveillance with detailed demographic information. Third, we examined the proportion of all COVID-19 cases that were identified as having multiple racial identities as compared with a state’s underlying percentage of its population that identified as having multiple racial identities. Finally, we examined state-by-state discrepancies in the collection of data on race and ethnicity.
Methods
For this analysis, we assessed case surveillance data from CDC’s national person-level COVID-19 case surveillance data from April 5, 2020, through December 1, 2021, from all 50 states. Jurisdictions voluntarily submit deidentified, standardized information electronically for individual COVID-19 cases to CDC. This information contains fields on age, race and ethnicity, sex, underlying comorbidities, and health outcomes. 15 Data on race and ethnicity could be collected in a method compliant with the 1997 OMB revised guidelines with the dataset including 1 field for each of the 5 OMB racial groups (AI/AN, Asian, Black, Native Hawaiian/Pacific Islander, and White), as well as a field for other race, other specified race, and race unknown. Individuals could be identified with 1 or more racial identities. Data on race were kept separate from data on ethnicity. Ethnicity was captured as either Hispanic/Latino or non-Hispanic/Latino. More information on the standardization and collection of these data is available elsewhere. 16 We compared these data with CDC national aggregate COVID-19 case surveillance data from April 5, 2020, through December 1, 2021, from all 50 states.
Both datasets obtained from CDC by the Urban Indian Health Institute were downloaded initially via the DCIPHER (Data Collation and Integration for Public Health Emergency Response) outbreak response database and subsequently from the CDC protect outbreak response database. Analysis was restricted to confirmed COVID-19 cases with an identified county of residence and complete CDC case notification that indicates the first day CDC was notified of the case. Analysis was restricted to after April 5, 2020, to reflect the period when the Council of State and Territorial Epidemiologists published the first interim case surveillance definition, making COVID-19 a nationally notifiable condition. 17
We analyzed the completeness of the collection of data on racial and ethnic identity via several steps. First, we determined the percentage of COVID-19 cases in the CDC database with complete ascertainment of race and ethnicity. For the purposes of this article, complete ascertainment of race and ethnicity is met only when a COVID-19 case was assigned to 1 or more of the 5 OMB race categories or had any response included in the “other specified race” field and ethnicity was identified as either non-Hispanic/Latino or Hispanic/Latino. Second, we used a Pearson correlation coefficient to determine whether there was a significant association in the number of person-level COVID-19 cases reported to CDC with complete race and ethnicity and time since April 2020 by day when divided by the number of aggregate COVID-19 cases on that same day. Third, we divided the number of COVID-19 cases with complete race and ethnicity in the CDC person-level database by the number of COVID-19 cases reported by aggregate counts of COVID-19 cases. Fourth, to examine the representation of individuals with multiple racial identities, we examined the percentage of COVID-19 cases with at least 1 identified racial identity and that indicated 2 or more racial identities. We then compared this number with the percentage of individuals in a state identified as having 2 or more racial identities per the 2020 postcensal estimates. To identify states whose proportions of multiracial individuals in COVID-19 case surveillance differed from state population estimates, we used an exact binomial test with a Bonferroni adjustment. 18 Fifth, to examine the collection of both race and ethnicity, we examined the percentage of individuals who had both race and ethnicity assessed among those with either race or ethnicity assessed. We performed these analyses for the aggregate of all 50 states and individually by state. Findings were considered significant at α = .05.
We used R version 4.0.2 (R Foundation for Statistical Computing) to conduct all analyses. We conducted analyses as part of public health surveillance activities, and the Seattle Indian Health Board determined this project to be exempt from human subjects approval.
Results
As of December 1, 2021, a total of 35 579 801 laboratory-confirmed COVID-19 cases were in CDC’s COVID-19 person-level case surveillance dataset, of which 33 815 355 (95.0%) had complete information on county of residence and CDC notification date.
Of these 33 815 355 COVID-19 cases, 22 347 629 (66.1%) had complete ascertainment of race, 23 562 878 (69.7%) had complete ascertainment of ethnicity, and 18 881 379 (55.8%) had complete ascertainment of both race and ethnicity (Table 1). On the same date, there were a total of 47 898 497 COVID-19 cases for all 50 states per aggregate counts. Therefore, 70.6% of COVID-19 cases are included in the CDC person-level case surveillance dataset. Proportional complete ascertainment of both race and ethnicity in CDC case surveillance relative to the aggregate number of COVID-19 cases was lowest in periods when the number of COVID-19 cases increased (eFigure 1 in Supplemental Material). We observed a significant negative correlation between the percentage of aggregate COVID-19 cases reported to the person-level case surveillance dataset with complete race and ethnicity information and days since April 5, 2020 (R = −0.18, P < .001) (Figure 1). We observed a significant positive correlation between the percentage of person-level COVID-19 cases with complete race and ethnicity information and days since April 5, 2020 (R = 0.71, P < .001) (eFigure 2 in Supplemental Material).
Person-level COVID-19 case surveillance data, aggregate national COVID-19 case counts, and number of person-level COVID-19 cases with complete data on race and ethnicity, a overall and by state, United States, April 5, 2020–December 1, 2021
Abbreviations: CDC, Centers for Disease Control and Prevention; HHS, US Department of Health and Human Services.
“Complete” race and ethnicity refers to a person-level COVID-19 case in which 1 of 5 Office of Management and Budget 1997 fields for race 10 (American Indian/Alaska Native, Asian, Black, Native Hawaiian/Pacific Islander, White) was listed as yes or another race was specified, and ethnicity was collected.
Centers for Disease Control and Prevention. 14

Rolling 14-day average percentage volume of COVID-19 cases as reported by aggregate counts to the Centers for Disease Control and Prevention (CDC) and COVID-19 cases reported by person-level national COVID-19 case surveillance with complete race and ethnicity information, 50 states, April 5, 2020–December 1, 2021. Solid points indicate daily percentage of aggregate COVID-19 cases reported in the person-level CDC dataset with complete race and ethnicity information. Dashed line indicates the least-squares best-fit line between date of case reported to CDC and percentage of aggregate COVID-19 cases reported in the person-level CDC dataset with complete race and ethnicity information. Shaded regions indicate 95% CIs. “Complete” race and ethnicity refers to a person-level COVID-19 case in which 1 of 5 Office of Management and Budget 1997 fields for race 10 (American Indian/Alaska Native, Asian, Black, Native Hawaiian/Pacific Islander, White) was listed as yes or another race was specified and ethnicity was collected. Data source: Centers for Disease Control and Prevention. 14
The states with the highest and lowest percentage of COVID-19 cases with complete race and ethnicity information relative to the aggregate count of COVID-19 cases were Vermont (n = 40 437/50 015; 80.8%) and North Dakota (22/162 087, 0%), respectively (Table 1). Eight states (Connecticut, Delaware, Georgia, Louisiana, Mississippi, North Dakota, Texas, and West Virginia) had <25% of aggregate COVID-19 cases sent to CDC with complete race and ethnicity information (Figure 2).

Percentage of COVID-19 cases as reported by aggregate counts to the Centers for Disease Control and Prevention and COVID-19 cases reported by person-level national COVID-19 case surveillance with complete information on race and ethnicity, 50 states, April 5, 2020–December 1, 2021. “Complete” race and ethnicity refers to a person-level COVID-19 case in which 1 of 5 Office of Management and Budget 1997 fields for race 10 (American Indian/Alaska Native, Asian, Black, Native Hawaiian/Pacific Islander, White) was listed as yes or another race was specified and ethnicity was collected. Data source: Centers for Disease Control and Prevention. 14
Of a total of 22 347 629 COVID-19 cases that ascertained at least 1 racial identity, 222 260 (1.0%) had 2 or more racial identities. When stratified by state, 5 states (Georgia, Hawaii, Nebraska, New Jersey, and West Virginia) had no COVID-19 cases that were identified as having 2 or more racial identities (Figure 3). The greatest percentage discrepancy between the percentage of COVID-19 cases identified as having 2 or more racial identities per state and the postcensal estimates of COVID-19 cases with 2 or more racial identities per state was in Hawaii (0% vs 24.6%).

Percentage of individuals listed as having 2 or more racial identities in the Centers for Disease Control and Prevention’s national COVID-19 case surveillance system versus state population data, 50 states, April 5, 2020–December 1, 2021. Hawaii’s 2020 postcensal estimates of its population that identified as multiracial was 24.6%. 13 Open circles indicate the percentage of person-level COVID-19 cases with at least 1 racial identity that have 2 or more racial identities. Closed circles indicate the 2020 postcensal estimates of a given state’s percentage population that identity themselves as having 2 or more racial identities. Horizontal lines indicate the difference between the state’s percentage of person-level COVID-19 cases with 2 or more racial identities and the state’s percentage population that identify themselves as having 2 or more racial identities. Error bars indicate 95% CIs for the estimate of the proportion of states’ COVID-19 cases with 1 or more racial identities identified that have 2 or more racial identities using exact binomial test with an associated Bonferroni adjustment for 50 tests. Data sources: US Census Bureau, 13 Centers for Disease Control and Prevention. 14
Among 33 815 355 COVID-19 cases with complete information on county of residence and CDC notification, 27 029 128 (79.9%) had either race or ethnicity information collected, and 18 881 379 (55.8%) had both race and ethnicity information collected. Of the 27 029 128 cases with either race or ethnicity information collected, 3 466 250 (12.8%) had only race information collected and 4 681 499 (17.3%) had only ethnicity information collected. When stratified by state, among cases for which race or ethnicity was ascertained, states differed in the percentage of cases with complete information on race, ethnicity, or both (Table 2).
Percentage of data by missingness type among COVID-19 cases with data on either race or ethnicity, 50 states, April 5, 2020–December 1, 2021
Abbreviation: CDC, Centers for Disease Control and Prevention.
Data source: Centers for Disease Control and Prevention. 14
Complete race refers to a person-level COVID-19 case in which 1 of the 5 Office of Management and Budget fields for race 10 (American Indian/Alaska Native, Asian, Black, Native Hawaiian/Pacific Islander, White) was listed as yes or another race was specified. Complete ethnicity refers to COVID-19 cases in which ethnicity was listed as Hispanic/Latino or non-Hispanic/Latino.
Data were suppressed when the total number of individuals was <10.
Discussion
Our findings provide considerations for using national COVID-19 case surveillance information. First, 33 states had <50% of all COVID-19 cases reported in aggregate sent to CDC’s person-level COVID-19 case surveillance database with complete race and ethnicity information. Second, many states either are not collecting information on multiple racial identities or are collecting this information only rarely. Third, many individuals had information on only race or ethnicity captured.
Our findings indicate that the total number of COVID-19 cases with complete data on race and ethnicity in COVID-19 case surveillance is <50% of the volume of COVID-19 cases reported in aggregate. To our knowledge, this analysis is the first to publish this information. Previous research focused separately on the percentage of COVID-19 cases sent to CDC and the percentage of COVID-19 cases in person-level surveillance with complete racial and ethnic information, while our study focused on the percentage of COVID-19 cases counted in aggregate that were sent to CDC with complete racial and ethnic information.4,16 We present the percentage of COVID-19 cases reported for each state with complete race and ethnicity information by comparing person-level surveillance data reported by CDC with CDC aggregate counts of cases. We found significant differences by state, with many states sending no information on race or ethnicity to CDC. This finding is important because COVID-19 datasets with race and ethnicity information aggregated by county of origin are now publicly available. 19 Understanding which states have low overall availability of data on race and ethnicity provides insight into the generalizability of studies examining race and ethnicity using these data. 20
According to the 2020 postcensal estimates, individuals with multiple racial identities compose 10.2% of the US population and at least 1.4% of every state’s population. 13 However, our study found that 49 states have significantly less than expected representation of individuals with multiple racial identities in COVID-19 data. Furthermore, 9 states had <10 individuals with multiple racial identities, and 5 states had no individuals with multiple racial identities. These findings indicate that several states are not able to store or transmit case surveillance data on multiple racial identities and suggest that most states are struggling to collect, store, or transmit case surveillance data on multiple racial identities. Our findings indicate that individuals with multiple racial identities are currently poorly represented in national COVID-19 case surveillance data. As has been observed in other studies, individuals with multiple racial identities have unique health risks and experiences compared with single-race individuals, such as lack of access to health care, increased risk of respiratory diseases, and increased risk of depression.21,22 As some states have struggled to collect even a single racial identity for COVID-19 cases, the capacity to assess and collect multiple racial identities poses clear logistical issues. However, without the collection of full and complete data on racial identities for COVID-19 cases, our understanding of the disease burden of COVID-19 on certain racial and ethnic communities (eg, AI/AN people, who are 39.4% multiracial) is incomplete. 13
Our study identified substantial differences in the collection of data on race and ethnicity by states, which undermine efforts to understand the burden of disease for all populations. 23 In cases in which data on race or ethnicity were obtained, states differed in the proportions of cases for which they recorded race or ethnicity. Prior research found that the collection of data on race and ethnic identity can result in the omission of race among those who identify as Hispanic/Latino, many of whom consider their Hispanic identity to be a part of their racial identity. 23 Furthermore, compared with non-Hispanic individuals, Hispanic individuals may have higher levels of discrepancies in how they see their racial identity and how their race is recorded by health care providers. 23 This is especially relevant to the AI/AN population, among whom 37.5% identify as Hispanic/Latino. As a result, AI/AN individuals may be more likely than non-Hispanic White individuals to have their race and ethnicity inaccurately recorded. The inconsistent collection of data on both race and ethnicity highlights the need for self-reporting of race and standardization in organizations on the methods used to capture data on race and ethnicity.
Limitations
Our analysis had several limitations. First, our analysis of person-level COVID-19 surveillance data includes only laboratory-confirmed COVID-19 cases, as all states report on this category at a minimum. As such, estimates may overestimate missingness for states that report aggregate numbers for both confirmed and probable COVID-19 cases. Notably, data obtained by CDC are not one-to-one copies of data collected by individual states. Interoperability issues limit the abilities of states to send their information to CDC. Third, the cases in the aggregate COVID-19 dataset do not represent all cases in an area because of COVID-19 cases not reported to the state or transmitted in aggregate. Fourth, lags from when cases are reported to CDC in aggregate versus when cases are transmitted to the person-level dataset might cause poor correlation between the 2 datasets. Fifth, the numbers in this study from state aggregate COVID-19 counts from CDC and from person-level COVID-19 case surveillance are aggregated at the state level. Therefore, there may be additional nuances of reporting practices at the county level that were not analyzed in our study. Finally, COVID-19 infection resulting in severe complications or death may make it impossible to obtain self-reported data on race and ethnicity, which may have resulted in less reliable assessments of these factors.
Conclusions
While state-specific insights into why variation in racial and ethnic data quality among states is occurring are generally unavailable, a report published by the Council of State and Territorial Epidemiologists identified several factors limiting the transmission of race and ethnicity data to CDC: information system limitations at the point of collection and public health agency, patient hesitation to indicate race or ethnicity, limited resources or staffing, and state laws or agency policies that limit or prohibit further sharing of race and ethnicity data for individuals with COVID-19. 24 Data modernization initiatives such as data storage and interoperability of systems offer promising results for streamlining the processes that are bottlenecking state capacity to fully and accurately document the race and ethnicity of individuals with COVID-19.25,26 Data modernization initiatives may especially benefit health departments in case surges when automated, standardized, and streamlined systems would be able to handle sudden increases in workload that might overwhelm the less flexible capacity of public health workers. 27 However, data modernization initiatives should be coupled with appropriate incentives for state and local health departments to reward adherence to already existing laws mandating the collection of data on race and ethnicity in COVID-19 case surveillance. 19
Our findings highlight a high degree of missingness of data on race and ethnicity in the US case surveillance system, which can enhance our understanding of current challenges of using these data to understand COVID-19’s impact on Black, Indigenous, and People of Color. First, our study identified substantial variation in the rate at which states are providing case surveillance with complete racial information to CDC. Second, our study highlights states that have no representation of individuals with multiple racial identities because of the methods by which these states collect data on racial identity. Finally, our study highlights ongoing issues in the simultaneous collection of data on race and ethnicity. These elements weaken effective surveillance of COVID-19 outcomes among Black, Indigenous, and People of Color, especially among multiracial and Hispanic communities. Without access to high-quality data on race and ethnicity, local, state, and federal health departments are unable to adequately allocate scarce public resources and address health disparities. These shortcomings in completeness of data on race and ethnicity require investments into public health infrastructure, including data modernization initiatives, adequate funding of public health staff, and education on the importance of collecting race and ethnicity data to build public health systems that provide a clear picture of disease burden in all communities.
Supplemental Material
sj-docx-1-phr-10.1177_00333549231154577 – Supplemental material for Completeness of Race and Ethnicity Reporting in Person-Level COVID-19 Surveillance Data, 50 States, April 2020–December 2021
Supplemental material, sj-docx-1-phr-10.1177_00333549231154577 for Completeness of Race and Ethnicity Reporting in Person-Level COVID-19 Surveillance Data, 50 States, April 2020–December 2021 by Scott Erickson, Rachael Bokota, Christine Doroshenko, Kate Lewandowski, Kojo Osei, Kaeli Flannery and Adrian Dominguez in Public Health Reports
Footnotes
Acknowledgements
The authors acknowledge Shannen Keene, MS, for assisting in editing and proofreading this article.
Authors’ Note
Coauthor Adrian Dominguez, MPH, passed away January 14, 2023. We dedicate this work to his memory as an epidemiologist, mentor, and champion for public health.
Declaration of Conflicting Interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: Research was supported by the US Department of Health and Human Services, Indian Health Service, Epidemiology Program for American Indian/Alaska Native Tribes and Urban Indian Communities, grant number U1B1IHS0006-21-00.
Supplemental Material
Supplementary figures for this article can be found at
. The authors have provided these supplemental materials to give readers additional information about their work. These materials have not been edited or formatted by Public Health Reports’s scientific editors and, thus, may not conform to the guidelines of the AMA Manual of Style, 11th Edition.
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
