Abstract
The FAIR data principles have emerged as a major focus in the world of scientific research data, but have not had as large an impact on official statistics. While there are good reasons for this, FAIR developments within the research community may be of interest to official statistical organizations. These include the increased availability of research data, improvements in the area of machine-actionable metadata, and a focus on provenance information which could lead to increased transparency and data quality. Some activities of interest are described as a starting point for those in official statistics who may wish to follow these developments.
Introduction
The FAIR data principles have had a major impact on how the scientific research community views the role of data. Instead of being a supporting asset, made available when required to validate research publications, data has become a primary output of the research process. Data is increasingly viewed as a potentially reusable resource – an asset resulting from the work of scientific research.
The same cannot be said of the impact of the FAIR data principles in the world of official statistics. As a result of the mission and motivation driving the work in these communities, FAIR may not seem to be as relevant in official statistics, and thus does not attract the same amount of attention. There are good reasons why official statistical organizations should pay attention to the FAIR data principles, however, even if they do not have the primacy that they do in the world of research. “FAIR data” and attendant developments in the research community have much to offer official statistics in the pursuit of its own missions, different as they are. This paper presents some of the arguments why FAIR and the attendant developments in the research community around data are something that may be of interest to official statistical organizations.
FAIR and the official statistics community
The FAIR Data Principles1
Wilkinson et al., The FAIR Guiding Principles for scientific data management and stewardship, Nature: Scientific Data, March 2015,
GO FAIR Foundation [
One reason that FAIR has attracted so much attention is that it presents a shift in attitude toward data within the scientific research community. Science is driven by research findings, emphasizing publications rather than on the data used to support those findings. Researchers and research organizations are rewarded by performing respected, high-impact research, and not primarily on the data they produce. While there are many exceptions, this is the general pattern we can see within the scientific community.
This shift can be explained if we examine some aspects of how data has impacted modern research, however. In a world where many research topics are inherently cross-domain (climate change, urban sustainability, disaster risk and response, infectious disease, etc. – sometimes described as the “grand challenges”), the effort required to prepare data for analysis within large research projects involving many organizations and disciplines can consume as much as 80% of the project budget3
EU Publications Office. “Cost of Not Having FAIR Research Data”, March 2018, [
We see this in the relative maturity of official statistics in these areas: the level of attention and resources given to data management, data dissemination, and the metadata needed to support these activities is higher. Further, because official statistical organizations often have major data reporting responsibilities, and a broad user base (policymakers, students, journalists, businesses, etc.), there has been a focus on standardization, as we can see in such collaborations as the Statistical Data and Metadata Exchange Initiative (SDMX).
Although not literally true, it would be understandable to think that official statistics has “always been FAIR,” as the idea of making data broadly available in a useful form is not a new one. It is also true, however, that the FAIR phenomenon in the world of scientific research is something that can benefit the official statistics community, and which is worth paying attention to and participating in.
Many aspects of the data landscape have changed in the recent past, both for scientific researchers in some disciplines, and for official statistics. The sources of data have assumed a greater variety, with survey data being supplemented with data from administrative registers, business transactions, social media, and an increasing array of automated systems that collect data to function. We can see this in the social sciences, for example, where “computational social science” has received a lot of attention, and in the official statistics world, where the HighLevel Group on the Modernization of Official Statistics (HLGMOS), based in the United Nations Economic Commission for Europe (UN/ECE), has addressed this theme.4
High-Level Group for the Modernization of Official Statistics (HLGMOS) [
At the same time, the demand for data has grown: this pressure is felt by official statistics and scientific research infrastructures alike. “Big data” technology, based on massively scalable no-SQL databases and similar technologies, has enabled the development of analysis methods that can consume huge amounts of data, without requiring the use of super-computers. Various developments in artificial intelligence such as the use of large language models demand large bodies of data to function effectively. In many cases, official statistics are used to provide context for more specific data in scientific research or are used to identify causal relationships to inform broad-based analysis. Ideally, these techniques consume not only the data, but also the attendant metadata.
This presents the producers and disseminators of data with a challenge: traditional methods of data documentation – often largely manual – are insufficient to support the growing amount of data and the demand for it. To provide sufficient metadata, it is necessary to use production systems and software which can capture or mine the metadata programmatically. In general terms, machines want fine-grained metadata to support the new analysis methods: documents describing data at a general level are not sufficient for direct consumption by machines. One example of this is how statistical classifications are published: where a human researcher could use a PDF detailing the classification, a machine demands a format that is both machine-actionable and standard. In the scientific world, we see a parallel phenomenon around ontologies and controlled vocabularies of different types.
This topic – the creation of granular, machine-actionable metadata – is of great interest within the FAIR community, and there are many ongoing developments that promise to help address the need. Among these are the work being done by CODATA and the International Union for the Scientific Study of Population (IUSSP) around FAIR vocabularies, and publications like the “Ten Simple Rules for making a Vocabulary FAIR”.5
Cox et al., “Ten Simple Rules for making a Vocabulary FAIR”, PLOS Computational Biology, April 2021, doi: 10.1371/journal. pcbi.1009041.
IUSSP-CODATA Working group on FAIR Vocabularies, “FAIR Vocabulries in Population Research”, April 2023 doi: 10.5281/zenodo.7818156.
EuroVoc [
Food and Agriculture organization, Caliper [
There is no easy solution to scaling up the documentation of metadata at a sufficiently granular level to meet increased demand, but there are many approaches being explored in the scientific research community which could be of benefit to the producers and users of official statistics as well. Collaboration here is in the interest of both communities.
One promise of the focus on FAIR in the scientific research community is the availability of significant amounts of research data in a more easily accessible form across many different domains. The emphasis within the FAIR community is on data and resource sharing, with secondary use of data and reproducibility of findings both being considered. But for the official statistical community, this may offer something different: a new source of data which can be used to support traditional production. There are some specific places where the increased availability of scientific research data might be useful, but they are not necessarily obvious, and there are some barriers to doing so.
First, the technical standards used by the official statistical community are not always the same as those used within the FAIR implementations in the scientific community, although there are some connections. SDMX is probably the most widely used technical standard for official statistics, but it is not used within research infrastructures, FAIR emphasizes RDF technologies, and SDMX does not, although the RDF Data Cube Vocabulary from W3C9
Data Cube Vocabulary [
Simple Knowledge organization System [
Extended Knowledge organization System [
While finding and accessing FAIR data from the scientific community promises to become easier, with a low cost in resources, the coverage of such data, and the methods used to produce it, also present some barriers, and impact how it can be used appropriately. Unlike official statistics, research data is often geographically localized, with a strong depth of focus on a particular phenomenon of interest. Methods are likewise oriented toward the research question under consideration, which may not align with the purposes of typical official statistical data collection. It should be noted that there are some exceptions to this, however: as an example, the European Social Survey covers the whole of Europe and is conducted as a repeated series, looking at social attitudes and behaviors. The metadata standard used to document the data – DDI Lifecycle12
DDI Lifecycle [
NUTS Glossary [
There is a role for more localized research data in official statistics, however, although it may not be straightforward. Localized data can be employed as a way of supporting quality checks, for example. In one case in Malawi – where the national data has been insufficient for understanding the impact and organizing response to natural disasters at a local level – research data conducted by scientists studying public health can serve to baseline small area estimates, helping to improve the quality of data for some disaster-related purposes. Such techniques for enhancing data quality are not new in the official statistics world, but the availability of detailed microdata for employing them is decreasing as a result of FAIR, and the technologies needed to make better use of this data source are improving, as a result of AI techniques. (See Sam Clark’s summary of his work at
It is not yet clear how the increased availability of scientific research data could benefit official statistics, but this is an area that is worth paying attention to. The COVID pandemic has given rise to many new data-sharing initiatives and platforms, and in general FAIR has emphasized the need for broad-based research infrastructure. In Europe we see the European Open Science Cloud (EOSC) being heavily resourced; in Africa, we see data-sharing efforts in public health such as VODAN14
VODAN [
INSPIRE Network [
African Open Science Platform [
Perhaps the single biggest impact that FAIR could have on official statistics is in the realm of data quality. The ideas of “data quality” in the two communities are very different, but some of the themes being pursued in FAIR are relevant to both sets of ideas. We will characterize quality as it has been approached in official statistics and in scientific research, and then look at how FAIR can impact these communities.
In the work around data quality, Official statistics has traditionally focused on consistent data production over time. Reporting frameworks such as the IMF’s Data Quality Assessment Framework (DQAF),17
Data Quality Assessment Framework [
Single Integrated Metadata Schema [
Scientific researchers have different definitions of quality. In some domains, the possibility of measurement error can be calculated, and this is a very specific metric for data quality – accuracy – which is specific to the methods used within a domain. In more general terms, data quality can be understood as “fitness for purpose”, that is, whether the data is useful for answering the research question being investigated. Thus, there is no single set of criteria for data, as data that is useful for one experiment may not be suitable for another: data quality, like beauty, is in the “eye of the beholder.” The implication of this is that the amount and granularity of metadata becomes a primary aspect of data quality: you cannot pre-assess the data for any given purpose, but you can provide sufficient information to allow the potential user to perform their own assessment.
This has led to a focus on provenance and data “context” in FAIR, which includes describing the sources of data, and the steps in their processing. A good example of this can be seen in the European Social Survey’s “ESS Labs – Climate Neutral and Smart Cities”.19
European Social Survey Labs [
We can think of this as a very comprehensive form of data documentation for the end user performing research, and it is, but it is also can be understood from the perspective of transparency. These are not different sets of information: provenance is important both to transparency and to reuse. This notion of comprehensive data description – including rich provenance information – thatcould help official statistics expand their idea of data quality in a similar direction, in line with discussions about this topic within the statistical community. From the perspective of standards/models, technical tools, and metadata there is not a lot of difference in terms of requirements, and it may be possible for the two communities to collaborate on the description of provenance for heightened data quality and transparency. Although they may use different terms for these concepts, there are fundamental similarities in the information they need and how they use it.
The case being made above is that developments within the FAIR community may be of interest to those in official statistics. To evaluate the value for any specific organization, it is clear that more investigation is needed. To this end, a list of potentially interesting activities is provided here. Several different developments within the FAIR community are mentioned above, but it can be difficult for people in the official statistics community to know where best to look for information and new developments. Below are some projects and activities which provide a starting point for those who wish to explore further. There is a wide range of activities in this area, so what appears below should not be taken as a comprehensive listing.
WorldFAIR Cross-Domain Interoperability Framework (CDIF): The WorldFAIR initiative is an EU-funded project with a global scope. It looks at 11 case studies in different domains, exploring practical capabilities and requirements for FAIR implementation. FAIR is seen as operating within domains/scientific disciplines and also between and among them. CDIF is a minimum set of profiles of existing domain-neutral metadata standards, and common web-based technology approaches for implementing FAIR to support cross-domain exchange and reuse of resources. It is worth noting that the standards and models used in the official statistical community such as SDMX20
Statistical Data and Metadata Exchange (SDMX) Initative, [
As of this writing, the first draft of the CDIF guidelines has yet to be published and is scheduled to be made available in the summer of 2024. Further development is anticipated. Among the standards being recommended are DCAT,21
DCAT [
Schema.org [
DDI-CDI [
PROC Ontology [
ODRL [
Data Privacy Vocabulary [
I-ADOPT Framework [
WorldFAIR Project [
FAIR Impact29
FAIR Impact [
The European Open Science Cloud (EOSC) is a membership consortium organized to develop and support a pan-European research infrastructure across disciplines. The EOSC Portal30
EOSC Protal [
European Social Survey Labs [
EOSB Interoperability Framework [
EOSB Core Services [
It should be noted that all of the initiatives above have a degree of cross-participation among their staff and institutions, and make efforts to keep their work aligned. Notably, the various “interoperability frameworks” are not duplicative, but to a large degree address different aspects of interoperability. All of the initiatives mentioned are still ongoing, so it is difficult to say with certainty where they will eventually sit relative to one another, but they are not being conducted in isolation, nor are they competitors.
GO FAIR Foundation “FAIR Implementation Profiles” (FIPs) are a key tool for evaluating an infrastructure, domain, or large organization in terms of
FAIRness. The GO FAIR Foundation has been a major force in the promotion of the FAIR data principles, and they have developed several tools that implementers may find useful. The FIPs are perhaps the most popular of these – you can learn more at the GO FAIR site.34
FAIR Implementation Profiles [
FIP Wizard [
IUSSP/CODATA Working Group on FAIR Vocabularies36
IUSSP-CODATA FAIR Vocabularies Working Group [
