Abstract
Scholars rely on accurate population and mortality data to inform efforts regarding the coronavirus disease 2019 (COVID-19) pandemic, with age-specific mortality rates of high importance because of the concentration of COVID-19 deaths at older ages. Population counts, the principal denominators for calculating age-specific mortality rates, will be subject to noise infusion in the United States with the 2020 census through a disclosure avoidance system based on differential privacy. Using empirical COVID-19 mortality curves, the authors show that differential privacy will introduce substantial distortion in COVID-19 mortality rates, sometimes causing mortality rates to exceed 100 percent, hindering our ability to understand the pandemic. This distortion is particularly large for population groupings with fewer than 1,000 persons: 40 percent of all county-level age-sex groupings and 60 percent of race groupings. The U.S. Census Bureau should consider a larger privacy budget, and data users should consider pooling data to minimize differential privacy’s distortion.
As coronavirus disease 2019 (COVID-19) grips the world, scholars, policy makers, and journalists use population data to calculate various population-level COVID-19 rates (incidence or new case rate, prevalence or total case rate, and mortality) to better understand, communicate, address, and inform mitigation efforts of the COVID-19 pandemic (Dowd et al. 2020; Wadhera et al. 2020). Because of these rate calculations, we know that the elderly are more susceptible to COVID-19-related mortality (CDC 2020) and that racial minorities are presently affected at higher rates (Price-Haywood et al. 2020). Accurate COVID-19 rate calculations and estimates are thus paramount to managing this and future pandemics. Inaccurately assessing COVID-19 could lead to misallocation of resources and interventions to mitigate the crisis.
The calculation of any population-level COVID-19 rate is relatively straightforward: one divides the COVID-19 counts (incidence, prevalence, and deaths) by the appropriate population counts from census data. To date, scholars have focused largely on properly counting COVID-19 deaths (Banerjee et al. 2020; Remuzzi and Remuzzi 2020) with a focus on the numbers of cases and deaths. However, scholars and policy makers in the United States must be mindful of population counts in the denominator of COVID-19 rate calculations because of the implementation of differential privacy (DP) in the publication of the 2020 census counts.
A disclosure avoidance system (DAS) will be implemented with the 2020 census tabulations (Mervis 2019), whereby population counts will be subject to noise infusion in an effort to protect respondent privacy. The U.S. Census Bureau is charged with protecting the confidentiality of its respondents. Beginning with the 1970 census, the Census Bureau has used a wide array of disclosure avoidance techniques to protect respondent confidentiality. These techniques include suppression of tables with small cell sizes, swapping or interchanging responses, and suppressing and then imputing responses (Zayatz 2007). Starting with the 2020 census, the Census Bureau plans to “modernize” its disclosure avoidance practices using DP (Ruggles et al. 2019). This is the first large-scale, census-based implementation of DP in the history of this methodology and represents a monumental sea change in population statistics (Garfinkel, Abowd, and Powazek 2018).
Under the Census Bureau’s proposed DAS using DP, population counts will be subject to noise infusion whereby random numerical values are added or subtracted to “true” population data, drawn from a statistical distribution under a specific privacy budget; the smaller the budget, the greater the noise. The Census Bureau then postprocesses the data to eliminate fractional and negative populations created during the DP process. The differences between the underlying, “true” population counts in the Census Bureau Summary File (SF) and the noise-infused DAS counts could lead to substantial over- or underestimation of COVID-19 rates, dependent on the divergence between the two. The Census Bureau has yet to finalize its DAS algorithm, though it is continually trying to improve the algorithm, and it is unclear how similar the demonstration products are to the final product. Importantly, the Census Bureau could implement less privacy in exchange for less noise and more utility.
Scholars are only beginning to study DP, its accuracy, and its consequences. The extent to which DP would distort the calculation of COVID-19-related rates is currently untested. For the calculation of COVID-19 incidence and prevalence rates, there will be no alternative to differentially private 2020 census data. Given how crucial population counts are for the evaluation and tracking of epidemiological rates, noise-infused population counts could lead to erroneous COVID-19 rate calculations and harm our ability to understand the current pandemic and manage future public health crises. Accurate population counts are just as important as accurate COVID-19-related counts. The COVID-19 rates produced following the implementation of the proposed DAS produce rates that could hinder our understanding of disparities arising from the events such as the illustrative case of the COVID-19 pandemic. Furthermore, the changes in these rates could lead to misstating the impact of COVID-19 across space and population subgroups, within small areas and more noticeably for racial/ethnic minorities.
In this short article, we demonstrate the extent to which DP could distort COVID-19 rates by age-sex and by race by combining the most recent Census Bureau DAS demonstration products segmented by age and sex (Van Riper, Kugler, and Schroeder 2020) with empirical COVID-19 age and sex mortality curves from the Centers for Disease Control and Prevention (2020) and a hypothetical 70 percent infection rate, constituting the theoretical herd immunity for the United States (Kwok et al. 2020). This allows us to simulate the difference between hypothetical mortality rate calculations using counts produced with DP from population counts produced using current methods. Although we use mortality rates, COVID-19 incidence and prevalence would be identical in both bias and in their rate calculation.
Methods
We use two primary sources of data in our estimates concerning the denominators for COVID-19 rate calculations and one primary source of data concerning the numerators. For the denominators, we use the 2010 county-level population estimates from traditional disclosure avoidance techniques and 2010 county-level population estimates produced with the proposed DP 2010 demonstration product (Van Riper et al. 2020) from May 27, 2020: the most recent file with age × sex detail. We accessed county-level population counts in 10-year age groups by sex and county-level population counts by race/ethnicity. The 2010 demonstration product simulates the DP algorithm on the 2010 census SF 1 to provide a comparison between traditional disclosure avoidance counts and the new DP counts. The DP demonstration product provides the denominators for calculating the COVID-19 mortality rates but not the numerators.
To calculate the number of anticipated COVID-19 deaths by age and sex, we apply empirical age and sex mortality rates from the Centers for Disease Control and Prevention (2020) to the 2010 Census Bureau SF 1 data that are not produced using DP and assume a 70 percent infection rate before herd immunity halts the spread (Kwok et al. 2020). This allows us to estimate the anticipated mortality for the underlying, “true” population (
For our race analysis, we apply empirical mortality rates from the Centers for Disease Control and Prevention to each race group r in each county i and a 70 percent infection rate to estimate the COVID-19 mortality rates under SF and DP (
We then calculate a mortality rate ratio (MRR), expressed as the ratio of the DP mortality rate to the SF mortality rate ([MDP/MSF] – 1), where values above 1.0 represent a DP mortality rate that exceeds the SF mortality rate.
Reproducible Research
All data and code necessary to reproduce the reported results are licensed under the CC-BY-4.0 license and are publicly available in a replication repository located at https://osf.io/2v7ea/?view_only=443404fc9af041dc876d0617385f9255.
Results
Figure 1a shows the distortion of COVID-19 age-sex-specific mortality rates by population size for U.S. counties using the 2010 demonstration products. We find that smaller age-sex populations have much higher absolute errors than larger populations. These errors are not limited to small areas or a single age group; rather these errors are present in all age groups. Additionally, using DP as the denominator causes some age-specific mortality rates to impossibly exceed 100 percent (red dots). For example, in the 2010 census, Kent County, Texas, contained 58 women aged 85 and older, but the DP count is 2. If COVID-19 incidence, prevalence, or fatalities exceed 2 individuals in this age-sex group, the COVID-19 calculated rate would impossibly exceed 100 percent. It is particularly worrisome that age-sex groups with fewer than 1,000 persons—more than 40 percent of all county-level age-sex groupings in the United States—exhibit particularly large errors (Table 1), making any meaningful COVID-19 rate calculation difficult to interpret for large segments of the country.

The distortion of coronavirus disease 2019 age-sex-specific mortality rates for U.S. counties. We show only those county age-sex groups with less than 500 percent error. Red dots correspond to county age-sex groups with mortality rates that impossibly exceed 1.0. The blue line is a locally estimated scatterplot smoothing (LOESS; span = 1). (a) Age-sex-specific mortality rates. (b) Race-specific mortality rates. Errors drop precipitously with at least 1,000 persons.
Absolute Percentage Errors by Population Size for Age-Sex Groups and for Race/Ethnic Groups.
Note: Pop refers to populations less than or equal to a given value.
The DAS distorts general mortality rates for racial/ethnic minorities (Santos-Lozada, Howard, and Verdery 2020), and Figure 1b shows the distortion of COVID-19 race-specific mortality rates by population size for U.S. counties. Such as with age-sex-specific mortality, error increases substantially as population size decreases for all race groups. Only white non-Hispanic exhibits the lowest error; all other race groups, including pooling all nonwhite groups together, exhibit large errors as population size decreases. Race groups with fewer than 1,000 persons—more than 60 percent of all county-race groups—exhibit the largest errors.
Balancing Data Privacy and Utility
We highlight how the planned 2020 census data under DP will significantly alter our understanding of COVID-19 via noise-infused population counts. Using age-sex-specific COVID-19 mortality curves from the CDC, we show that DP will introduce substantial errors in COVID-19 expected age-sex-specific mortality rates—sometimes causing age-specific mortality rates to exceed 100 percent—hindering our ability to understand the pandemic. These errors are particularly large for approximately 40 percent of county age-sex groupings and 60 percent of county-race groupings containing fewer than 1,000 persons. Overall, DP will introduce significant challenges in our understanding of the COVID-19 global pandemic expected to last well into 2021.
Age groups with fewer than 1,000 persons can occur in counties with relatively large total populations. Autauga County, Alabama, has four age-sex groups with fewer than 1,000 persons (men 85 and older, women 85 and older, total 85 and older, and men 75 to 84) yet had a total population in 2010 of 109,000. Autauga County is one of approximately 500 U.S. counties with more than 100,000 people (putting it in the top 15th percentile of population size), yet it still contains age groups with fewer than 1,000 people. Thus, large distortions are not limited to small, rural counties but can be found in relatively large, urban counties as well.
How are we to understand this pandemic if the very foundation upon which we calculate the most basic rates contains significant distortion? How will cities, states, and the federal government effectively manage the current or future pandemics if crucial denominators are untrustworthy? The populations most at risk for DP distortion, namely, elderly and minority populations, are the very groups COVID-19 harms the most and are in need of the most targeted interventions. If we cannot parse out the noise from the true values, we are left with a muddied vision of the pandemic, and our responses will further reflect that uncertainty. To provide some guidance, we offer recommendations for the Census Bureau and those calculating COVID-19 rates.
The Census Bureau is still fine-tuning its DP algorithm and has previously expressed concern about the trade-off between privacy and utility (Abowd and Schmutte 2019). A second run of the DP algorithm dealt with numerous concerns of the data user community (U.S. Census Bureau 2020), yet its utility still needs assessment. However, the Census Bureau release of the second run of the DP algorithm only contains race-sex breakouts, making it impossible to conduct such an assessment. Census data are foundational to many kinds of analyses, including some analyses the Census Bureau probably never envisioned, and unfortunately the COVID-19 pandemic arose in the midst of the Census Bureau’s privacy changes. Because the Census Bureau DP demonstration products are so new, deep analysis of the impact these changes will have on the utility of public health data is yet to be undertaken. As we show, the DP algorithm, as proposed, sacrifices the usefulness of basic COVID-19 calculations in many counties and population groups.
There is still time for the Census Bureau to continue refining its DP algorithm or improve the privacy budget to allow more stable estimates in more population groups. The first 2020 census data products were originally slated for release in December 2020, but with the updated 2020 census timeline, the first products should be released by April 2021. The Centers for Disease Control and Prevention lags health and mortality data, making detailed COVID-19-related analyses very likely reliant on 2020 census noise-infused population counts rather than population counts produced using traditional methods. If the DP algorithm continues to produce distorted COVID-19 rates, data users might turn to outdated population estimates released prior to DP in their COVID-19 calculations.
The Census Bureau should consider alternative data sets specifically tailored for COVID-19 analyses, alternative DASs, or a larger privacy budget during this historic pandemic. It is entirely possible that future scientists, during the next major pandemic, will turn to the remnants of the COVID-19 data to understand their own pandemic—data that DP will certainly distort. The decisions the Census Bureau makes now will have long-term repercussions for what we can learn about COVID-19. Scholars, policy makers, and journalists turn toward the last major global pandemic, the 1918 influenza, to draw important parallels from the historical clues left behind in photographs, newspapers, and scientific articles. Those parallels play a powerful role in shaping public discourse, even with their historical patina. When we look back on COVID-19 during the next major global pandemic, any statistical measures arising from the United States will be far less meaningful because of the infusion of noise in the very building blocks of COVID-19 rates.
When, and not if, the Census Bureau releases DP data, all data users analyzing COVID-19 need to be aware of these limitations in using DP data for COVID-19 analyses. On the basis of our findings, we offer three recommendations to scholars and policy makers. First, we suggest a minimum cell size of 1,000 persons for the calculation of any COVID-19 rates (incidence, prevalence, and mortality). The distortion in COVID-19 rates rapidly shrinks as population sizes increase, especially in sizes larger than 1,000 persons. Second, scholars and policy makers can combine areas to create larger cell sizes via regions, sacrificing geographic detail for population specificity. The Census Bureau uses this approach for its public use microdata samples, and we recommend a similar approach for COVID-19 analyses. Third, scholars can pool data together in either wider age intervals (i.e., 20-year age intervals rather than 10-year age intervals) or wider race classifications (i.e., using the Office of Management and Budget’s minimum race classifications rather than the fully detailed nine-race classification). These strategies, either in isolation or in combination, will minimize the distortion in COVID-19 rate calculations.
The Census Bureau’s demonstration product presently contains only age-sex-county and race-county breakdowns and does not contain age-sex-race-county. Yet race differentials in COVID-19 mortality are an important aspect of the pandemic (Hooper, Nápoles, and Pérez-Stable 2020). The potential errors in COVID-19 mortality by age and sex are already significantly large, and we believe that analyzing COVID-19 mortality by age-sex-race would further reduce cell sizes, ensuring an even greater number of combinations with fewer than 1,000 persons, the identified threshold with the largest errors.
As the pandemic continues, scholars, policy makers, and journalists should embrace minimum standards for COVID-19 analyses using the 2020 census and subsequent data products. Recent visualizations by the New York Times and the Centers for Disease Control and Prevention (CDC 2020; Oppel et al. 2020) concerning racial/ethnic disparities in COVID-19 demonstrate the intense hunger for detailed COVID-19 analysis. Future analyses should be, at a minimum, informed of the issues of using noise-infused population counts and should incorporate the strategies outlined above to ensure analyses accurately reflect their chosen measurement and the social phenomenon of interest.
Footnotes
Acknowledgements
We gratefully acknowledge early comments and feedback from B. Jarosz, J. Howard, M. Taylor, and D. Van Riper.
