Sage Journals: Discover world-class research

Abstract

Keywords

blended data complex survey data data linkage disclosure risk national statistical offices synthetic data

1. Background

1.1. Official Statistics and Descriptive Inference

Official statistics are produced by national statistical offices (NSOs) to provide high quality descriptions of demographic, economic, environmental, and other population characteristics of a country. Such descriptions may be based on complex surveys of households or establishments or may draw on information from censuses and registries. While statistical modeling may help to support these descriptions (coverage or nonresponse adjustments, model-assisted calibration methods) or to allow otherwise infeasible descriptions (small area estimation, “bridging” to give continuity in an estimated series after some methodological change), the fundamental goal is to draw conclusions, with quantifiable uncertainty, about observable characteristics of a real population. Typical official statistics are thus classic examples of “descriptive inference.” An expectation of official statistics is that they are designed to be widely useful, and they must maintain quality, objectivity, utility, integrity, and relevance over time. For example, the Office of Management and Budget in the U.S. provides twenty standards and related guidelines for Federal censuses and surveys (U.S. Office of Management and Budget 2006).

1.2. Analytic Inference

In “analytic inference” the inferential goal is to draw conclusions, with quantifiable uncertainty, about typically unobservable characteristics of a population, real or hypothetical. Examples include prediction of population characteristics in some unobserved scenarios, such as a future time point or an alternative policy regime. Parametric superpopulation models, assumed to generate the observed finite population as one possible realization under one possible parameterization, fall within the analytic inference paradigm.

Causal inference is naturally analytic. Using the framework of Rubin (1974), each unit (e.g., person, establishment, etc.) must have positive probability of being exposed to each treatment (or intervention or policy) of interest. Each unit has a potential outcome associated with each treatment at a given time point, following treatment exposure. The causal effect is the difference between the potential outcome of one treatment versus that of another. As each unit can only be exposed to one treatment, one of its potential outcomes—and thus the causal effect—is unobservable. Causal inference thus requires invoking identification assumptions.

Valid statistical methods for analytic inference have developed along complementary paths. First, one can carefully build features of the complex survey design into standard likelihood-based (frequentist or Bayesian) statistical approaches. Second, one can use asymptotic properties of survey-weighted estimators which, under relatively mild conditions, yield asymptotically normal estimators of finite population quantities, including solutions of estimating equations. Because the asymptotic normality is preserved as finite population quantities converge to superpopulation parameters, a chaining argument then gives reasonable analytic inference using only the design-based point estimation and variance estimation tools of descriptive inference. This chaining approach is built into standard software for survey statistics and allows a single set of tools to be used for both descriptive and analytic inference.

1.3. Analytic Uses of Official Statistics

Official statistical data, such as a weighted survey data set, may be used directly in the fitting of analytic models or may be relied on less directly as a “representation” anchor, such as in calibration weighting for a new survey or combining of probability and nonprobability samples. For analyses aiming to address causal hypotheses, official statistical data products play a crucial role when determining whether findings from intervention studies such as randomized controlled trials (RCTs) generalize from the study sample to the population for which an intervention (e.g., a program or policy) is intended (Stuart et al. 2011). For example, the Institute on Education Studies in the U.S. notes that the American Community Survey is a potential representation anchor for some of its studies’ goals (Tipton and Olsen 2022). Official statistical data have a similar role in studies of whether a causal estimate can be transported to external populations (Degtiar and Rose 2023).

1.4. Analytic Users of Official Statistics

We distinguish among “designers” of official statistics, who are involved in the development of surveys and other collections and can directly influence development toward analytic goals; “primary analysts” who have full access to data and information on design; and “secondary analysts” who rarely have full access to data or full information on design. Our focus is on secondary analysts, outside the NSOs: what challenges and research opportunities are there in support of secondary analytic inference from official statistics?

2. Challenges for Analytic Inference with Official Statistics

The following discussion primarily reflects the authors’ experience with the decentralized NSOs of the United States but is expected to be relevant for other statistical systems.

Secondary analysts will continue to rely on the availability and quality of official statistics, while NSOs are likely to face further challenges in their ability to deliver those statistics. It is essential to maintain mutual support between the research community and the NSOs.

2.1. The Need for Blended Data

The set of analytic and causal questions posed by secondary data users is large and dynamic, and many research questions will be unanswerable with a single official statistical source. Thus, combining a standalone official statistical data product (e.g., survey) with additional data sources (e.g., another survey or administrative data) is often necessary. Increasingly, analysts will need to rely on blended data products that rely on weighting, linkage, matching, or other statistical methods to integrate information across data sources.

For example, causal inference in observational studies often requires thorough lists of covariates such as for propensity scoring (Rosenbaum and Rubin 1983); instrumental variables to account for non-ignorable treatment receipt (Angrist et al. 1996); or longitudinal data. A key aspect of a treatment is that there generally needs to be some sort of duration of time to see its effect. Longitudinal datasets produced for official statistics are ideal for examining causal research questions because they allow for time to elapse between potential treatments and outcomes. Many countries’ NSOs produce longitudinal datasets to study topics related to labor, income, education, and health. Examples include the U.S. Bureau of Labor Statistics’ National Longitudinal Surveys, Statistics Canada’s Survey of Labour and Income Dynamics, and Statistic Sweden’s Swedish Longitudinal Integrated Database for Health Insurance and Labour Market Studies.

But many research questions do not have a relevant longitudinal dataset, and researchers may need to link multiple cross-sectional data sources together to represent different time periods. In such studies, researchers need to be especially aware of complications such as recall bias and survivor bias when considering research questions in which the treatment occurred in the past. The U.S. Centers for Disease Control and Prevention’s Behavioral Risk Factor Surveillance System is an example of a cross-sectional dataset that collects information on a wide variety of health behavior and outcome topics and has been widely used in causal or analytic studies of the effect of health behaviors.

Blending of data products, of whatever form, complicates analysis and general-purpose tools for analytic inference are not often readily available. There is progress in some areas that suggests paths forward, such as the literature on combining probability and nonprobability samples and using quasi-randomization methods to quantify uncertainty.

2.2. The Need for Privacy Protections

The blending of different data products may heighten disclosure risks, so that NSOs will apply some statistical disclosure control (SDC) methods before they are released as official statistics. These methods might include variable coarsening, variable suppression, data swapping, or differential privacy methods.

Privacy protections for data products complicate statistical inference of all types, and general-purpose tools for inference to take the privacy protections into account are not readily available. This will remain an active area of research.

One useful approach for secondary analysts is the release of a synthetic version of the dataset that has similar properties as the underlying data. Accurately reflecting the true population across many relationships and giving accurate analysis for a wide variety of analyses is an enormous challenge. One promising area of research for synthetic data generation is the use of generative adversarial networks (Figueira and Vaz 2022) and other machine learning methods. Raghunathan (2021) and Drechsler and Haensch (2024) give summaries of recent methods.

2.3. The Need for Detailed Metadata

Metadata describe data, including statistical objects such as variables, value domains, data sets, questions, sampling plans, estimators, and more (SCOPE Metadata Team 2020). Even without the further complexities of blending and disclosure protections, researchers may find the available metadata lacking for analytic purposes. For example, general purpose survey design and nonresponse weights might not be tailored to analyses of a focal variable for causal inference, requiring users to customize their weights, as in Polsky et al. (2005).

The key challenges for secondary analysts are that they have neither the influence on NSO studies of a designer nor the access (to data and metadata) of a primary analyst, and the key challenges of NSOs are that they cannot allow such influence and provide such access while creating general-purpose official statistics of sufficient quality in a timely and cost-effective way.

3. Opportunities

Previous generations of secondary analysts and NSOs have faced similar challenges, with a natural tension between the specific research needs of analysts and the broad descriptive aims of NSOs. Official statistics were not designed to address analytic questions generally or to focus on specific analytic questions. But the research community developed suitable methodologies (e.g., survey-weighted estimation, analytic variance approximations, and variance estimators, including replication methods) and corresponding software to yield reliable inferences with relatively minimal design information. NSOs now routinely provide sufficient metadata to support many analytic uses for their data products. These observations suggest the following recommendations for NSOs and for the broader research community.

3.1. Recommendations for NSOs

3.1.1. Engage in Best Practices for Outreach to Researchers

Broadly, NSOs should strive to increase their understanding of the needs and perspectives of secondary users of official statistical data for analytic and causal inference. For example, researchers at Statistics Netherlands have explored some causal frameworks and applications (Pijpers and Boeters 2023). There are opportunities for NSOs to recognize and build upon the connections between surveys and causal inference with the goal of deepening their understanding of user needs and perspectives. There is a small but growing literature describing bridges between the two. Entry points for survey-focused statistical staff in NSOs might include Keiding and Louis (2016), Mercer et al. (2017), Rosenblum et al. (2019), and Yang and Kim (2020).

While an expansion of analytic inference knowledge in NSOs might be beneficial, it is primarily important for NSOs to understand the analytic and causal goals of its secondary users and the challenges they face. This will inform designers on strategies to facilitate sound secondary analysis. NSOs should track the uses of their official statistics across user groups and seek input from those groups whenever new design decisions are to be made.

One model for interaction with user groups is the U.S. Census Bureau’s adoption of the “Statistical Products First Approach” (Keller 2023), which inverts the typical model for official statistics data products. Rather than focusing on the data and then seeing what use-cases are possible, the idea is to first gather use-cases and develop a statistical product that supports those purposes and uses.

3.1.2. Create Secure Linkage Environments

A second approach that NSOs can use to promote analytic uses of their data products is to invite secondary analysts in as primary analysts. NSOs can create or leverage existing environments to foster secure linkage and data integration more broadly. As one example, the Foundations for Evidence-based Policymaking Act of 2018 (“Evidence Act”) in the U.S. aims to expand the use of data by agencies and other stakeholders for policy- and decision-making. Among its directives are to enhance data sharing across agencies and for agencies to develop Learning Agendas that outline high-priority evaluation questions along with descriptive and analytic tasks. A demonstration project is underway to inform the design of a National Secure Data Service (NSDS) that would facilitate the data sharing and secure data linkages required for such evidence-building. The NSDS aims to broaden data availability for a wider user community relative to that served by Federal Statistical Research Data Centers (FSRDCs). FSRDCs have long been used by academics to access and analyze sensitive data for analytic purposes, so an NSDS would be a promising development for those interested in linking official statistical products to other data sources for analytic and causal inference.

3.1.3. Develop Best Practices for Dissemination of Metadata

To the extent possible, NSOs should provide detailed metadata for its official statistics and technical guidance for appropriate secondary analysis as user needs and experiences emerge. This dissemination of metadata can follow best practices as described in broad-based quality guidelines, such as the Federal Committee on Statistical Methodology (2020) Framework for Data Quality.

3.2. Opportunities for the Research Community

Naturally, there will continue to be opportunities for the research community to address new analytic questions and to develop new statistical models, new methods of integrating information from different data sources, new methods of quantifying uncertainty, and new privacy protections. The specific challenges of connecting analytic inference to official statistics lie in making those models and methods pragmatic.

Secondary analysts currently have readily available tools for many analytic models (categorical data analysis, linear and generalized linear models, survival analysis, etc.) in standard software using minimal design information (stratum identifiers, primary sampling unit identifiers, and weights; or replicate weights). The research community, including secondary analysts, needs to be engaged in creating similarly pragmatic statistical tools for analytic inference for official statistics with more complex structures, such as blended data of various types or with various types of privacy protections. Pragmatic tools will be those that use as little information as possible from the NSOs: how little is enough to yield defensible inferences?

If NSOs are as transparent as possible (given privacy protections) in the release of official statistics and their metadata, and statistical researchers are as creative as possible in winnowing down the requirements for defensible inferences, secondary analysts should continue to have the data and the tools they need to address analytic inferential questions from official statistics.

Footnotes

Author Contributions

All three authors contributed equally to the conception and writing of this paper.

Funding

The author(s) received no financial support for the research, authorship, and/or publication of this article.

ORCID iDs

F. Jay Breidt

Robert Ashmead

Received: January 11, 2025

Accepted: February 3, 2025

References

Angrist

Imbens

Rubin

1996. “Identification of Causal Effects Using Instrumental Variables.” Journal of the American Statistical Association 91 (434): 444–55. DOI: https://doi.org/10.2307/2291629.

Degtiar

Rose

2023. “A Review of Generalizability and Transportability.” Annual Review of Statistics and Its Application 10: 501–24. DOI: https://doi.org/10.1146/annurev-statistics-042522-103837.

Drechsler

Haensch

A. C.

2024. “30 Years of Synthetic Data.” Statistical Science 39 (2): 221–42. DOI: https://doi.org/10.1214/24-STS927.

Federal Committee on Statistical Methodology. 2020. “A Framework for Data Quality.” Technical Report, Federal Committee on Statistical Methodology FCSM 2020–04.

Figueira

Vaz

2022. “Survey on Synthetic Data Generation, Evaluation Methods and GANs.”Mathematics 10 (15): 2733. DOI: https://doi.org/10.3390/math10152733.

Keiding

Louis

T. A.

2016. “Perils and Potentials of Self-Selected Entry to Epidemiological Studies and Surveys.”Journal of the Royal Statistical Society: Series A (Statistics in Society) 179 (2): 319–76. DOI: https://doi.org/10.1111/rssa.12136.

Keller

S. A.

2023. “Introducing the Statistical Product First Approach.”https://www2.census.gov/about/partners/cac/sac/meetings/2023-09/presentation-introducing-statistical-product-first-approach.pdf.

Mercer

A. W.

Kreuter

Keeter

Stuart

E. A.

2017. “Theory and Practice in Nonprobability Surveys Parallels Between Causal Inference and Survey Inference.” Public Opinion Quarterly 81 (SI): 250–71. DOI: https://doi.org/10.1093/poq/nfw060.

Pijpers

Boeters

M. A.

2023. “The Role of Causal Modelling in Official Statistics.” Technical Report, Statistics Netherlands. https://www.cbs.nl/en-gb/background/2023/45/the-role-of-causal-modelling-in-official-statistics (accessed January 10, 2025).

10.

Polsky

Doshi

Thompson

Paddock

2005. “Differential Loss to Follow-Up by Insurance Status in the Health and Retirement Study: Implications for National Estimates on Health Insurance Coverage.”Archives of Internal Medicine 165 (21): 2537–8. DOI: https://doi.org/10.1001/archinte.165.21.2537-b.

11.

Raghunathan

T. E.

2021. “Synthetic Data.”Annual Review of Statistics and Its Application 8 (1): 129–40. DOI: https://doi.org/10.1146/annurev-statistics-040720-031848.

12.

Rosenbaum

P. R.

Rubin

D. B.

1983. “The Central Role of the Propensity Score in Observational Studies for Causal Effects.”Biometrika 70: 41–55. DOI: https://doi.org/10.1093/biomet/70.1.41.

13.

Rosenblum

Miller

Reist

Stuart

E. A.

Thieme

Louis

T. A.

2019. “Adaptive Design in Surveys and Clinical Trials: Similarities, Differences and Opportunities for Cross-Fertilization.”Journal of the Royal Statistical Society: Series A (Statistics in Society) 182 (3): 963–82. DOI: https://doi.org/10.1111/rssa.12438.

14.

Rubin

1974. “Estimating Causal Effects of Treatments in Randomized and Nonrandomized Studies.”Journal of Educational Psychology 66 (5): 688–701. DOI: https://doi.org/10.1037/h0037350.

15.

SCOPE Metadata Team. 2020. “Metadata Systems for the U.S. Statistical Agencies, in Plain Language.”https://nces.ed.gov/fcsm/pdf/Metadata_projects_plain_US_federal_statistics.pdf (accessed January 10, 2025).

16.

Stuart

E. A.

Cole

S. R.

Bradshaw

C. P.

Leaf

P. J.

2011. “The Use of Propensity Scores to Assess the Generalizability of Results from Randomized Trials.” Journal of the Royal Statistical Society: Series A (Statistics in Society) 174 (2): 369–86. DOI: https://doi.org/10.1111/j.1467-985X.2010.00673.x.

17.

Tipton

Olsen

2022. “Enhancing the Generalizability of Impact Studies in Education (NCEE 2022-003).” Technical Report, U.S. Department of Education, Institute of Educational Sciences, National Center for Education Evaluation and Regional Assistance. http://ies.ed.gov/ncee.

18.

U.S. Office of Management and Budget. 2006. “Standards and Guidelines for Statistical Surveys.”https://www.whitehouse.gov/wp-content/uploads/2021/04/standards_stat_surveys.pdf (accessed January 10, 2025).

19.

Yang

Kim

J. K.

2020. “Statistical Data Integration in Survey Sampling: A Review.”Japanese Journal of Statistics and Data Science 3 (2): 625–50. DOI: https://doi.org/10.1007/s42081-020-00093-w.

Challenges and Opportunities for Analytic and Causal Inference with Official Statistics

Abstract

Keywords

1. Background

1.1. Official Statistics and Descriptive Inference

1.2. Analytic Inference

1.3. Analytic Uses of Official Statistics

1.4. Analytic Users of Official Statistics

2. Challenges for Analytic Inference with Official Statistics

2.1. The Need for Blended Data

2.2. The Need for Privacy Protections

2.3. The Need for Detailed Metadata

3. Opportunities

3.1. Recommendations for NSOs

3.1.1. Engage in Best Practices for Outreach to Researchers

3.1.2. Create Secure Linkage Environments

3.1.3. Develop Best Practices for Dissemination of Metadata

3.2. Opportunities for the Research Community

Footnotes

Author Contributions

Funding

ORCID iDs

References