Sage Journals: Discover world-class research

Abstract

Keywords

methodology development total survey design official statistics futures

1. Introduction

The joint use of multiple data sources to make statistics is a long-time practice in Official Statistics. The National Accounts, some indexes and other “accounts-like” products have relied on integrating data from multiple sources for decades (e.g., see MacFeely et al. 2024). I will not discuss these mature Multisource statistics. Neither is it room to describe the full variety of methods and situations at hand, nor to dive into the well-established survey theory research on utilizing multiple sources. For the former I refer to De Waal et al. (2020), and for the sample survey theory to Lohr and Raghunathan (2017) and Rao and Lohr (2025). They provide comprehensive background overviews on recent development and trends. Instead, my angle on the use novel data in a multisource context, encompasses the influence of two modernization approaches on the future directions and methodological innovation of official statistics. Especially how two different business arms of National Statistical Offices (NSOs) interact with the approaches. Opportunities are there to be seized from a deliberate and closer orientation between the classic end-to-end production arm disseminating statistical outputs, and the growing business arm providing data services with the NSOs as trusted stewards of integrated national data assets.

2. Two Modernization Approaches Influencing NSOs Uptakeof Novel Data Sources and Multisource Statistics

2.1. The Exploratory Approach

One approach to modernization is characterized by an exploratory pursuit of alternate (sometimes unconventional) data sources. A driver for this is to leverage the societal digitalization and its resulting data deluge. This approach has generated plenty of studies of data sources with the ambition to make new statistics. Often the new statistics appear as “experimental” or in study reports. A common feature is that data are designed and collected by entities other than NSOs. These data sources, may be labeled Big, Organic, Found, Administrative, or something else and do not originate from controlled probability samples. Examples cover data from social media, scraped websites, cell phones, digital monetary transactions (from card companies and banks), various sensors, ship tracking, truck tachographs, satellites, registers, and more.

When acquiring these data, the NSOs have little or no influence on the data generation processes and data are almost never structured for statistical purposes. Therefore, substantial efforts are necessary to find out basics like: What do the data sources cover? What concepts are measured? And sometimes even, what type of unit does a record represent? These are considerable challenges and one reason why the exploratory approach on new data sources struggles to make major insteps in official statistics. Other reasons are that these data often have limited content, and they are frequently held by one or several private sector actors. Hence, any upfront advantage of getting masses of fast and cheap observations from these sources can easily be canceled out by thin content and complex (technical, legal, and relational) data supply pathways.

To improve the value proposition of an exploratory approach to novel data, NSOs should have a plan to use the data in a multisource statistics context from the beginning. Otherwise, the discovered data sources, when acquired, risk becoming monoliths which are expensive to sustain and have limited relative value. Advancing structured and basic data integration capabilities is necessary and essential. When investing and building the supporting IT solutions, resources can be spared through some simple and inexpensive steps. This includes deciding early on a standard for statistical base units and base populations and agreeing to the information models underpinning the data structures for making statistics (e.g., see Holmberg 2015; Holmberg and Bycroft 2017; Sundgren 2010).

2.2. The Problem-Oriented Approach

I describe the second and (so far) more successful modernization approach with novel data sources as problem-oriented. It can include all types of data mentioned above, but the intentions are not exploration of options and new applications to expand the scope of official statistics. Instead, the objectives are a mix of enhancing quality, reducing respondent burden and lowering production costs of existing statistics.

2.3. Problem Example 1: Improvement to Production Processes

A pertinent NSO problem is to adapt to the changed landscape for direct data collection. Acquiring sufficient and representative data from targeted populations has become harder and more expensive. Response rates (RR) in sample surveys, particularly for individuals and households, appear to be universally declining and consequently, survey results are questioned. NSOs have made investments to improve their processes by using techniques that ease respondent workloads and encourage participation, for example, by introducing new collection modes and mixing them. This may have slowed down but not stopped the negative spiral for RR. However, by using additional data sources alongside collected data, the RR-problem and its effects NSOs can be tackled. In the survey design context, all options have not yet been exhausted for achieving modernization by introducing additional data sources as auxiliary data. If NSOs can set up efficient structures (auxiliary data repositories) that join up the business functions around data integration with new non-probability data sources, and traditional direct collection, they can reap significant benefits. From a survey point of view, methods can be introduced and applied which are preventative (improving survey designs), reactive (alleviating and informing collection and processing steps), and corrective (decreasing bias with better estimation). How well this works, boils down to the accessibility and strength of the auxiliary data. Note that this also applies for NSOs with advanced register systems.

The common conventional view is that directly collected survey data is the first (main) data source which is complemented and assisted by one or more additional sources. This is, if not straightforward, at least well studied and mature in a probability sample framing. The alternative reverse order, with the (non-probability) data source(-s) first and a complementary probability sample designed thereafter is newer. It has recently and will continue to attract research. Overview descriptions and discussions of the principles for such multisource solutions are given by Citro (2014), Lothian et al. (2019), De Broe et al. (2021), and Coffey et al. (2024). Articles that cover some of the technical aspects when combining nonprobability data sources with probability samples are Beaumont (2020), Kim and Tam (2020), Rao (2021), Lenau et al. (2021), Zhang (2021).

When considering using multiple sources to address an information need, my view is that NSOs should start with a data source agnostic approach. This lends itself well to applying extensions of the Total Survey Design/Total Survey Error frameworks (see Biemer 2010; Biemer et al. 2014; Groves and Lyberg 2010; Reid et al. 2017; Rocci et al. 2022; Zhang 2012). Although using the frameworks largely is a theoretical exercise, it is very practical in the sense that it means thinking before linking and not the reverse. Applying a Total Survey Design approach yields a systematic assessment whether a multisource approach have benefits in the first place. And if that is the case, in which order and how the data sources should be combined to achieve the best statistical outcomes with the available resources. If it is done diligently by considering trade-offs, capability constraints and demands on technical investment, then the chances for a better overall outcome and more sustainable decisions increase.

2.4. Problem Example 2: Census Improvements

The modernization of Population and Housing Censuses is another area where NSOs extensively have researched effective use of multiple sources. Many countries are transitioning from a traditional census with enumerators and mailout forms, to utilizing existing administrative data as the primary data sources. In the 2021 EU-Census, administrative data sources were the primary and main type of data source for most EU and EFTA countries. It is worth noting that since it is based on output harmonization, the EU-Census itself represent an exceptional Multisource statistics product. All countries rely on their own set of available administrative sources plus the option to collect new data in a conventional way. They then combine these sources individually in the “best” possible way to achieve the same agreed output goals.

The JOS special issue about Coverage Problems in Administrative Sources (Journal of Official Statistics 2015), and the paper by Dunne and Zhang (2024) with discussions illustrate some of the methodological advances for Censuses utilizing and relying on multiple sources. The papers also show a shift to viewing outputs from a Census as results from an estimation process, rather than an exercise of enumerating, counting and compiling records. Since Census cycles are long, and since many important societal pillars such as election boundaries and socioeconomic decisions rely on them, any transformations and implementations of such shifts take time and will be done cautiously. For quality assurance the theoretical frameworks for estimation theory must be applied. But also in this context applying the Total Survey Design paradigm is useful.

2.5. Other Problem-Oriented Applications

In addition to the classic problems mentioned above, NSOs also look to Multisource solutions for analytical problems and as an option when developing statistics to meet new demands such as environmental statistics. For the analytical problems, data linking is usually customized for each research project. New data assets with multiple data sources are then established for research to inform policymaking. The linkages are typically done at the unit level in safe microdata environments or in statistical register systems designed for statistical and research purposes.

This role as data stewards and providers of data integration and sharing services to other users has grown massively for NSOs the past twenty-five to thirty years.

Although production of statistics mostly determined by regulations and government funding with clearly defined content and quality requirements still dominates, these data services are playing a significant role on their own. For most NSOs these services have become a second business arm, in addition to the production and dissemination of predominantly descriptive statistics.

3. Future Prospects: Aligning the Two Business Arms Yields Improved Multisource Design Opportunities

As mentioned, NSOs can improve their cost-benefit ratio by showing foresight and early adopt a Multisource mindset for data structures and study design. By planning for a second life and sensible, safe reuse of data in environments with systemized and standardized integration capabilities, NSOs can enhance their role in the data ecosystems and for the statistics they produce.

Given the cost and RR challenges, it is clear that modernized official statistics can no longer rely solely on their own collected data. However, if a quick response to sudden demands is needed, then other non-probability sources are often poor alternatives because of their rigidity in content. Relevant data may exist across several domains and there must be effective ways to integrate them for a multisource analysis.

To meet new demands, the ideal survey portfolio for NSOs would have a system’s perspective where content is synchronized and where the “system” also includes the capability for surveys to complement content from non-probability sources in a customized manner. That is, when no relevant data exist for a problem, a solution is to have the capacity and capabilities to employ a multisource design collecting the missing parts. This system should include NSO infrastructure components such as standards, frameworks, and certain fundamentals like statistical base registers, geospatial assets, and population sampling frames used for statistics production.

With internal data security and confidentiality rules in place, both the traditional business arm responsible for end-to-end statistics production, and the arm responsible for data services gain from aligning their multisource functions. The statistics production side can improve efficiency and content quality, while the data service analytical side becomes more responsive. Successful alignment also requires bridging cultural difference that sometimes exist between the business arms. It is not unusual for units working with maintenance of registers, register statistics and data integration processes to be unfamiliar with usual statistics production practices and vice versa.

3.1. Expected Demands Methodological Development

In the near future we need and can expect more methodological research on novel data sources and multisource statistics. Here are some examples.

3.1.1. Statistics Quality

Quality frameworks exist but they are not mature or complete for nonprobability sources and Multisource combinations (see e.g., Eurostat 2019). More real applications are needed to illustrate important design trade-offs between statistical quality and resource consumption. In a problem-oriented approach, I recommend a broadened Total Survey Design view to study errors versus cost alternatives for offered Multisource options.

3.1.2. Statistical Design, Sampling, Estimation, and Inference

Research on using nonprobability data sources, either alone or combined with probability sample sources, should continue. Understanding the coverage and representativeness questions are fundamental to assess the value and mitigations using non-probability data. Comparing these sources with a reference population over time can help benchmark their merit. Representativeness cannot simply be assumed; it should be verified through rigorous audit sampling methods (Zhang 2021). It is also expected that the underlying data processes of these sources continue to be scrutinized to evaluate measurement errors and assumptions. When using more than two sources, the order of linking them is another research field which in turn can be associated to availability of identifiers and the levels and hierarchies of the unit records. Where to best do the integration and the level of unit aggregation is not always a given.

3.1.3. Improvements of Data and Statistics Processing

R&D on best practice methods for multiple sources in statistics production covers many areas. Examples include using auxiliary data to improve instruments and fieldwork in direct collection and automating processing steps. This include ensuring the survey instruments capture concepts that, depending on objective, overlap, validate, or complement those of other data sources; enabling better field work through timely sourced and relevant paradata to steer adaptive design setups, propose suitable collection modes in near real-time and effective editing throughout the process where coherence is more important than before. Wagner et al. (2023) and the references therein illustrate some of this field of work. In addition, as illustrated by Zhang and Haraldsen (2022), Statistical Disclosure Control also need attention in a new way, especially in integration phases.

3.1.4. Artificial Intelligence/Machine Learning (AI/ML) Applications

Data-dependent methods are candidates for the improved processing theme above. Examples include automating processing to improve data quality by detecting anomalies and inconsistencies within and between sources; automatic handling of coding and classification of incoming data to standard; autofill of information with assistance of effective voice and image processing; new uses of text processing for example, discoverability on NSO websites and metadata for researchers, consistency checks of table-to-text output, detecting respondent fatigue and improving issues because of poor measurement instruments. AI/ML could also be considered for linkage strategies. Research on uncertainty quantification for AI/ML is needed, such as well-designed resampling techniques to capture variability. The transparency of the methods is also a topic for finding good practice.

3.1.5. Graph Data Theory, Models and Applications

Graph databases have emerged as a powerful tool for storing and querying complex data relationships. Unlike traditional relational databases, graph databases handle interconnected data points, making them ideal for integrating multiple data sources and joining data semantically with graph linking. These databases then enable more sophisticated queries and analyses to identify patterns and relationships between different data sources, providing valuable insights that traditional methods cannot obtain easily. Research is needed on good estimation practise and uncertainty quantification also here. With semantically marked-up data, resampling methods are an option.

3.1.6. Advanced Time and Space Dimension Capabilities

Modern data analysis critically involves the time and space dimension. By incorporating time-based indicators of change, such as SUTSE models and nowcasting techniques, statisticians can gain more accurate and timely insights into trends and patterns. More research is needed for a dynamic and responsive data analysis, particularly in rapidly changing environments. While NSOs may not need or have resourcing for real time data processing of new data sources, similar methodologies can be useful in production situations. Building geospatial capabilities that go beyond than data visualization on maps and use the space dimension for advanced analysis and integration is also an area for improvement for NSOs.

4. Final Reflections

Methodological innovations are essential for addressing the challenges and maximizing the value of using multiple data sources for NSOs. It also is an enabler for classic statistics production to capitalize from the data steward business arm and vice versa. As technology evolves, information models, AI/ML, integration methods and tools such as graph databases will play an increasingly important role in improving statistical quality, managing effective statistical processes, and enriching responsiveness to analyzing complex data. Ultimately, success depends on NSOs’ ability to adapt to new methodologies and leverage the emerging technologies.

Footnotes

Author’s Note

In this note I reflected on the title topic with a perspective on the “business” of official statistics. The viewpoints expressed are entirely mine and not those of my employer.

Funding

The author received no financial support for the research, authorship, and/or publication of this article.

ORCID iD

Anders Holmberg

Received: March 30, 2025

Accepted: May 26, 2025

References

Beaumont

J.-F.

2020. “Are Probability Surveys Bound to Disappear for the Production of Official Statistics?” Survey Methodology 46 (1): 1–28. https://www150.statcan.gc.ca/n1/en/pub/12-001-x/2020001/article/00001-eng.pdf (accessed May 25, 2025).

Biemer

2010. “Total Survey Error: Design, Implementation, and Evaluation.” Public Opinion Quarterly 74 (5): 817–48. DOI: https://doi.org/10.1093/poq/nfq058.

Biemer

Trewin

Bergdahl

Japec

2014. “A System for Managing the Quality of Official Statistics, with Discussion.” Journal of Official Statistics 30 (3): 381–415. DOI: https://doi.org/10.2478/JOS-2014-0022.

Citro

2014. “From Multiple Modes for Surveys to Multiple Data Sources for Estimates.” Survey Methodology 40 (2): 137–61. https://www150.statcan.gc.ca/n1/pub/12-001-x/2014002/article/14128-eng.pdf (accessed May 25, 2025).

Coffey

S. M.

Damineni

Eltinge

Mathur

Varela

Zotti

2024. “Some Open Questions on Multiple-Source Extensions of Adaptive-Survey Design Concepts and Methods.” Journal of Official Statistics 40 (1): 16–37. DOI: https://doi.org/10.1177/0282423X241235270.

De Broe

Struijs

Daas

, et al. 2021. “Updating the Paradigm of Official Statistics: New Quality Criteria for Integrating New Data and Methods in Official Statistics.” Statistical Journal of the IAOS 37 (1): 343–60. DOI: https://doi.org/10.3233/SJI-200711.

De Waal

van Delden

Scholtus

2020. “Multi-Source Statistics: Basic Situations and Methods.” International Statistical Review 88 (1): 203–28. DOI: https://doi.org/10.1111/insr.12352.

Dunne

Zhang

L.C.

2024. “A System of Population Estimates Compiled from Administrative Data Only.”Journal of the Royal Statistical Society Series A: Statistics in Society 187 (1): 3–21. DOI: https://doi.org/10.1093/jrsssa/qnad065.

Eurostat. 2019. ESSnet KOMUSO: Work Package 1 Quality Guidelines for Multisource Statistics (QGMSS), Version 1.1, October 14. https://wayback.archive-it.org/12090/20231229201538/https://cros-legacy.ec.europa.eu/system/files/qgmss-v1.1_1.pdf (accessed May 25, 2025).

10.

Groves

Lyberg

2010. “Total Survey Error: Past, Present and Future.” Public Opinion Quarterly 74 (5): 849–79. DOI: https://doi.org/10.1093/poq/nfq065.

11.

Holmberg

2015. “Discussion in Special Issue on Coverage Problems in Administrative Sources.” Journal of Official Statistics 31 (3): 515–25. DOI: https://doi.org/10.1515/JOS-2015-0031 (accessed May 25, 2025).

12.

Holmberg

Bycroft

2017. “Infrastructure for the Use of Big Data to Understand Total Survey Error: Examples from Four Organizations - Statistics New Zealand’s Approach to Making Use of Alternative Data Sources in a New Era of Integrated Data.” In Total Survey Error in Practice, edited by Biemer

De Leeuw

Edwards

, et al. Wiley New York. https://doi.org/10.1002/9781119041702.ch21 (accessed May 25, 2025).

13.

Journal of Official Statistics. 2015. “Special Issue on Coverage Problems in Administrative Sources.”Journal of Official Statistics 31 (3).

14.

Kim

J. K.

Tam

S. M.

2020. “Data Integration by Combining Big Data and Survey Sample Data for Finite Population Inference.” International Statistical Review 89 (2): 382–401. DOI: https://doi.org/10.1111/insr.12434.

15.

Lenau

Marchetti

Münnich

, et al. 2021. “Methods for Sampling and Inference with Non-Probability Samples.” Deliverable D11.8, Leuven, InGRID-2 project 730998 – H2020. https://www.inclusivegrowth.eu/files/Output/D11.8-Methods-for-sampling-and-inference-with-non-probability-samples-updated-on-website.pdf (accessed May 25, 2025).

16.

Lohr

S. L.

Raghunathan

T. E.

2017. “Combining Survey Data with Other Data Sources.” Statistical Science 32 (2): 293–312. DOI: https://doi.org/10.1214/16-STS584 (accessed May 25, 2025).

17.

Lothian

Holmberg

Seyb

2019. “An Evolutionary Schema for Using ‘It-Is-What-It-Is’ Data.” Journal of Official Statistics 35 (1): 137–65. DOI: https://doi.org/10.2478/JOS-2019-0007.

18.

MacFeely

Van de Ven

Peltola

2024. “To GDP and Beyond: The Past and Future History of the World’s Most Powerful Statistical Indicator.” Statistical Journal of the IAOS 40 (3): 685–711. DOI: https://doi.org/10.3233/SJI-240003.

19.

Rao

J. N. K.

2021. “On Making Valid Inferences by Combining Data from Surveys and Other Sources.” Sankhya 83: 242–72. DOI: https://doi.org/10.1007/s13571-020-00227-w.

20.

Rao

J. N. K.

Lohr

2025. “Trends and Directions in Sample Survey Theory and Methods.” Survey Methodology 51 (1): Forthcoming.

21.

Reid

Zabala

Holmberg

2017. “Extending TSE to Administrative Data: A Quality Framework and Case Studies from Stats NZ.” Journal of Official Statistics 33 (2): 477–511. DOI: https://doi.org/10.1515/JOS-2017-0023.

22.

Rocci

Varriale

Luzi

2022. “Total Process Error: An Approach for Assessing and Monitoring the Quality of Multisource Processes.” Journal of Official Statistics 38 (2): 533–56. DOI: https://doi.org/10.2478/jos-2022-0025.

23.

Sundgren

2010. “A Systems Approach to Official Statistics.” In Official Statistics: Methodology and Applications in Honour of Daniel Thorburn, edited by Carlson

Nyquist

Villani

. Stockholm University. http://officialstatistics.files.wordpress.com/2010/05/bok18.pdf (accessed May 25, 2025).

24.

Wagner

Zhang

Elliott

M. R.

West

B. T.

Coffey

S. M.

2023. “An Experimental Evaluation of a Stopping Rule Aimed at Maximizing Cost-Quality Tradeoffs.” Journal of the Royal Statistical Society: Series A (Statistics in Society) 186 (4): 788–810. DOI: https://doi.org/10.1093/jrsssa/qnad059.

25.

Zhang

L. C.

2012. “Topics of Statistical Theory for Register-Based Statistics and Data Integration.” Statistica Neerlandica 66 (1): 41–63. DOI: https://doi.org/10.1111/j.1467-9574.2011.00508.x.

26.

Zhang

L. C.

2021. “Proxy Expenditure Weights for Consumer Price Index: Audit Sampling Inference for Big-Data Statistics.” Journal of the Royal Statistical Society: Series A (Statistics in Society) 184: 571–88. DOI: https://doi.org/10.1111/rssa.12632.

27.

Zhang

L. C.

Haraldsen

2022. “Secure Big Data Collection and Processing: Framework, Means and Opportunities.” Journal of the Royal Statistical Society: Series A (Statistics in Society) 185 (4): 1541–59. DOI: https://doi.org/10.1111/rssa.12836.

Future Pathways Embracing Multisource Statistics and Novel Data Sources at National Statistical Offices

Abstract

Keywords

1. Introduction

2. Two Modernization Approaches Influencing NSOs Uptakeof Novel Data Sources and Multisource Statistics

2.1. The Exploratory Approach

2.2. The Problem-Oriented Approach

2.3. Problem Example 1: Improvement to Production Processes

2.4. Problem Example 2: Census Improvements

2.5. Other Problem-Oriented Applications

3. Future Prospects: Aligning the Two Business Arms Yields Improved Multisource Design Opportunities

3.1. Expected Demands Methodological Development

3.1.1. Statistics Quality

3.1.2. Statistical Design, Sampling, Estimation, and Inference

3.1.3. Improvements of Data and Statistics Processing

3.1.4. Artificial Intelligence/Machine Learning (AI/ML) Applications

3.1.5. Graph Data Theory, Models and Applications

3.1.6. Advanced Time and Space Dimension Capabilities

4. Final Reflections

Footnotes

Author’s Note

Funding

ORCID iD

References