Abstract
Eurostat is charged with providing high-quality statistics for Europe, and the current information landscape is making it increasingly challenging to do so. This paper presents the Eurostat dissemination approach (including the traditional dissemination vectors), and thereafter proceeds to present the recent initiatives to make European statistics data and metadata available in the form of Linked Open Data (LOD). After presenting some of the main challenges for open data dissemination (complete reproducibility, availability of high-quality LOD, capacity to consume LOD and achieving meaningful mashups between official statistics LOD and other data sources), it concludes by noting the potential of LOD to foster transparency, reproducibility, collaboration, interdisciplinary research driving scientific advancements, and contributing to a broader understanding of complex scientific challenges.
Introduction
Guiding principles for Eurostat dissemination
The European Statistical System (ESS) is the partnership between the statistical authority of the European Union, i.e. Eurostat (a Directorate-General of the European Commission), and the national statistical institutes (NSIs) of the European Union Member States [1]. Eurostat has various tasks [2], not the least in the field of dissemination, where the tasks include:
the dissemination of European statistics in accordance with the statistical principles of professional independence, impartiality, objectivity, reliability, statistical confidentiality and cost-effectiveness as defined in the Regulation on European Statistics [1] and as further elaborated in the European Statistics Code of Practice (ESCoP) [3]. ensuring that European statistics are made accessible to all users in accordance with statistical principles – and in this respect, providing the technical explanations and the support necessary for the use of European statistics.
Consequently, Eurostat’s overall mission is to provide high-quality statistics for Europe, and to this end, Eurostat develops and promotes standards, methods and procedures that allow the cost-effective production and dissemination of European statistics. It cooperates with international organisations in order to the facilitate the global comparability of European statistics [4].
Trust in an era of disinformation
The spread of disinformation and the increasing sophistication of the phenomenon, the technological transformation, a growing demand for new statistics to measure societal phenomena, users’ changing habits and expectations have all led to a systematic reflection and changes in the way Eurostat and its partners in the ESS collect, produce, disseminate and communicate statistics to their users. Arguably, a major surge in public attention to the ‘fake news’ phenomenon took place in 2016, in connection with the Brexit referendum in the United Kingdom and the presidential election in the United States. Coincidentally (or presciently), the Digital communication, User analytics and Innovative products (DIGICOM) project was conceived already in 2015 [5].
This project, which ran from 2016–2019 and brought together participants from nearly all countries of the ESS, included the dissemination and promotion of, and the communication of the value of, European statistics as a reliable basis for evidence-based decision-making and an unbiased picture of society [6]. After the end of DIGICOM, these activities have been mainstreamed, with Eurostat’s main strategic goal [4] being to, in the context of growing disinformation, remain an independent and trusted point of reference for statistics and data on Europe, necessary for better decisions, policies and public debate in the European Union.
Open data – a necessary but not a sufficient condition
Eurostat has a long-standing tradition of open data dissemination, with all Eurostat data having been freely available online since 1 October 2004 [7]. However, the mere act of putting data in the public domain does not achieve Eurostat’s task of disseminating official statistics in line with the ESCoP [3]. Most notably, ESCoP principle 15 on accessibility and clarity sets out a number of additional requirements in terms of e.g. the presentation of statistics and the corresponding metadata (including the methodology of statistical processes), the use of modern information and communication technology, methods, platforms and open data standards. Principle 15 also covers access to microdata for researchers (for which the DIGICOM project included a specific strand).
In terms of modern technology and open data standards, the DIGICOM project was a frontrunner also in this regard, with a sizeable ‘Open Data Dissemination’ work package, which aimed to explore and identify solutions in relation to a number of questions that arose with the prospect of disseminating official statistics in a way that would give active users as much freedom as possible to create their own products. The benefit for official statistics producers was clear: active users of official statistics can also be seen as ‘redisseminators’ who would create tailor-made products and services for their users/clients using official statistics, thereby enhancing data quality by adding value to the statistical information supplied, and help official statistics producers reach new audiences. Arguably, this could also (already through the increased use and perceived utility) be conducive to an increase in trust in official statistics.
The DIGICOM project did also lead to the identification of a number of concrete challenges [8] concerning Linked Open Data (LOD):
an increased need for documentation and/or standardisation to enhance sharing, a number of gaps to fill for ensuring conceptual and syntactical interoperability in the ESS for LOD. Many of the existing ESS assets (data, metadata and documentation) needed further adaptation before they could be integrated and provide a seamless experience to users.
The objectives and challenges described in the European Data Strategy [9] – which emphasises the importance of open data in driving innovation, improving public services, and promoting transparency – echo those identified as part of DIGICOM. The strategy sets out several goals related to open data, including:
Ensuring that data are available for reuse: The strategy calls for making more data available for reuse, particularly in the public sector. This includes making data available under open formats and licenses that allow for reuse and redistribution. Promoting data interoperability: The strategy emphasises the importance of data interoperability, which refers to the ability of different systems and applications to link and exchange data with each other. Interoperability can in turn help to promote the reuse and sharing of data.
As could be seen from Section 1.2.2, there are various unmet challenges which need to be tackled by Eurostat and its ESS partners in order to improve the adherence to ESCoP principle 15 and to contribute to the implementation of the European Data Strategy when it comes to tapping the full potential of open data. In this paper we will focus on recent and ongoing achievements in the field of Eurostat’s dissemination to meet those challenges, with an emphasis on LOD.
We start by presenting (in Section 2), the current Eurostat dissemination approach, including the Eurostat website is a major component therein (not the least through the Eurostat data browser) and the key role it plays in the dissemination of European statistics. Thereafter, we widen the scope by presenting data.europa.eu (DEU) [10] – the central point of access to European open data from international, European Union, national, regional, local and geodata portals. This is done in Section 3, which focuses on the collaboration between Eurostat and the Publications Office of the European Union (OP) when it comes to populating DEU with quality European statistics metadata. Considering the high impact of European statistical classifications, and the various ongoing initiatives in improving their dissemination, we treat them separately in Section 4. We then proceed to present a number of challenges – many of them already being tackled by the official statistics community – in Section 5, and end with some concluding remarks in Section 6.
The Eurostat dissemination approach
Guided by the principles set out in 1.1, the Eurostat communication and dissemination strategy [11] defines the operational framework for ensuring that trustworthy European statistics are widely accessible to users and also well understood by anyone looking for reliable data on Europe. It describes the wide range of existing statistical products on offer, highlights the areas that will require further attention in the future, and lists the actions to be taken at different stages of the communication cycle.
Users in focus
When developing its products and services, Eurostat takes a number of steps to ensure its communication and dissemination actions are user-centered, starting from knowing the Eurostat users and their needs. To this end, the DIGICOM project featured a user analytics work package, which for instance rendered results [8] in the form of a typology of Eurostat users (user personas), guidelines on user analytics and usability guidelines.
Many of the good practices established in the project have been mainstreamed and integrated into the Eurostat approach, which is based on user behaviour analysis as well as user feedback obtained in several ways, from user interactions via the support system, social listening on Eurostat social media channels, to usability testing and user research.
Metadata as a prerequisite for useful official statistics
Official statistics data are of little use without the accompanying metadata. Metadata provide essential information about the data, enabling users to identify them, to understand the content of the data, to get information on how to access or download them and to assess the data quality. For maximum utility, they should be offered in both human-readable and machine-readable format.
Accordingly, principle 15 of the ESCoP states that data should always be made available with supporting metadata. Eurostat fully subscribes to this principle, and consistently integrates the appropriate metadata in all its vectors of dissemination [12]. This includes [13]:
structural metadata which are used to represent the structure of the dataset (dimensions, attributes, variables), as well as reference metadata that describe statistical concepts, methodologies used for the generation of data or evaluation of the quality.
Eurostat website
The full range of Eurostat’s products and services is provided on the Eurostat public website (ec.europa.eu/ eurostat) free of charge, as has been the case since 2004 [7]. The Eurostat website was revamped in 2022, with the aim of facilitating data access with new and improved features and making it clearer and more user friendly as well as fully in line with the European Commission Web Accessibility Directive [14].
Eurostat data browser
The main tool for data dissemination is the data browser [15]. The data browser was launched in November 2022, and is the result of a multiannual project to overhaul the Eurostat dissemination chain. The data browser allows users to customise, visualise and extract statistical data in an easy and interactive way. The data browser gives users easy access to data and metadata and gives them control of their user experience by allowing them to easily customise data visualisations and save their favourite views for later use, download data in a wide range of formats, including Excel, SDMX 2.0 and 2.1, TSV and JSON-stat, easily share their datasets through bookmarks and social media, and more. New users benefit from a comprehensive online support system, which includes a ‘first visit’ guide [15].
Machine-to-machine access to Eurostat data
While the data browser offers a traditional (‘point and click’) user interface, more advanced users (including for instance research institutions and private enterprises active in the data economy) need to access data without unnecessary manual operations. Therefore, Eurostat offers an application programming interface (API) for data access [16]. Given the increasing use of the R programming language in official statistics, Eurostat has developed the
Open data dissemination of European statistics via data.europa.eu
Exposing European statistics datasets on data.europa.eu
In 2017, Eurostat started uploading the catalogue of its data on the EU Open Data Portal, a predecessor to DEU [10]. The catalogue includes a description of the Eurostat datasets in RDF (Resource Description Framework) format including links to the distributions or visualisation of the datasets, and their reference metadata in different formats (SDMX, CSV, TSV).
Today, over 8 000 Eurostat datasets are published on DEU – the main point of access for European open data that aims to improve access to open data, foster high-quality open data publication at all levels, and create impact through open data reuse – with Eurostat feeding the portal twice daily. It should be noted that the datasets themselves do not reside on the portal – DEU provides links to the data and serves as an entry point, allowing Eurostat datasets to be discovered in various ways.
Common vocabularies for describing European statistics datasets
Each description in DEU follows the Data Catalogue vocabulary Application Profile (DCAT-AP) [18] specifications, which provides a common vocabulary used for describing the resources in data catalogues with the objective to enhance the data findability and to promote the reusability.
While DCAT-AP provides specifications for describing any type of dataset from the public sector, StatDCAT-AP [19], which is an extension of DCAT-AP, enables the description of statistical datasets within the statistical domain. It provides a dissemination vocabulary for statistical open data, defining several additions to the DCAT-AP model that can be used to describe the structure of the statistical datasets such as the dimensions and attributes, units of measurement, quality annotations, the number of data series or the length of time series. To enrich the descriptions of European statistics datasets, Eurostat is currently working on the integration of StatDCAT into DEU.
Identification of all European statistics datasets through persistent identifiers
Persistent identifiers for dataset descriptions
Each dataset description in the data catalogue is identified by a unique persistent identifier (PID) that is both human- and machine-readable. For instance, the PID
While these PIDs are unique, they are specific to DEU. In contrast, Digital Object Identifiers (DOIs) are in common use, and hence more commonly recognised – in particular in the scientific community. A DOI is a specific type of PID and is composed of unique strings of characters used to permanently identify a digital asset – such as a dataset or a scientific article. They are often found on the internet in the form of a link which enables any potential user to reliably locate a digital asset.
Persistent identifiers for datasets in the form of DOIs
In February 2023, Eurostat started assigning DOIs to its datasets to permanently identify them. The datasets published by Eurostat are assigned DOIs in the unique namespace
As the official DOI registration agency for the institutions, bodies, offices and agencies of the European Union, the OP registers the DOIs and their metadata at DataCite [21], a nonprofit organisation that provides persistent identifiers for research data. DataCite has its own metadata schema [22], which offers core metadata properties chosen for an accurate and consistent identification of a resource for citation and discovery purposes. DataCite creates PIDs for its dataset descriptions in a consistent manner. For instance, the aforementioned DOI 10.2908/TAG00039 does, when appended to a common stem, constitute the PID of the DataCite description [23] of the European statistics dataset ‘Production of milk powder’ [20].
Eurostat foresees to assign DOIs to all European statistics datasets, and thereafter add the European statistics DOIs to those dissemination vectors (see 2.3) for which this is practically feasible.
The advantages of using DOIs for European statistics datasets include the following:
A unique and persistent identifier for datasets ensures that the data can be identified and accessed by other researchers, even if the data are moved to a new location. Improving the discoverability of datasets. Helping researchers track citations of a specific dataset and avoid citation errors, such as citing a different dataset with the same title. Researchers can cite any source data that they have reused or integrated (multiple sources) into a new dataset. Fostering consistency and interoperability across different data repositories and platforms. Facilitating the tracking of usage metrics and analytics for research datasets, which can provide insights into how the data are being used and shared by the research community.
To summarise, the use of DOIs is essential in ensuring that research datasets are easily accessible, discoverable, and citable, which in turn helps to facilitate the advancement of science and innovation.
The key role of classifications for unlocking the potential of LOD
When coupled with the appropriate metadata architecture, metadata have the potential to improve findability, accessibility, storage, preservation, analysis, comparison, reproducibility, inconsistency identification, correct interpretation, visualisation, data linkage, assessment and ranking of the quality of data and avoiding unnecessary duplication of data [24].
Statistical classifications (used for standardising concepts in a statistical domain) constitute one key category of structural metadata, as they are necessary for the production of reliable, comparable and methodologically sound official statistics. As described by Hoffmann and Chamie ([25] p. 2), classifications group and organise information meaningfully and systematically into a standard format and involve an exhaustive and structured set of mutually exclusive and well-described categories. Therefore, they play a crucial role in organising, integrating, and leveraging the potential of LOD – enhancing, inter alia, data discovery and search as well as data integration, interoperability and comparability.
Eurostat has a high level of knowledge and experience in the development of classifications and is the custodian of several sectoral and transversal European statistical classifications used to produce European statistics [26]. Eurostat is also responsible for covering the European dimension of the international statistical classifications (ISIC, CPC) that are reference classifications for European statistical classifications (NACE, CPA) under its responsibility. As illustrated in Part I of the NACE Rev. 2 introductory guidelines [27], each statistical classification typically exists in a statistical ecosystem, where it is normally interlinked with other classifications – either structurally, or by means of correspondence tables.
Streamlining the dissemination via the Euro SDMX registry
Since the early 2000s and until 2023, Eurostat disseminated the statistical classifications used for the production of European statistics (as well as relevant correspondence tables involving those classifications) via the ‘Eurostat Reference and Management of Nomenclatures’ platform (RAMON). Once the decision was taken to phase out RAMON, Eurostat seized the opportunity to streamline and modernise the way in which statistical classifications are disseminated.
One of the ways in which the dissemination of classifications was upgraded was via the Euro SDMX Registry [28]. This registry is the Eurostat implementation of the SDMX Registry specifications as published by the Statistical Data and Metadata Exchange (SDMX) initiative. To streamline its dissemination of statistical classifications, Eurostat has converted all the classifications previously available in RAMON into SDMX/XML format and is now disseminating them via the Euro SDMX Registry.
Dissemination of statistical classifications as LOD
Eurostat also pursued a second approach by converting the main classifications used for the production of European statistics into RDF format and exposing them as LOD in EU Vocabularies [29] and in Cellar (the semantic repository of the OP) [30]. This was done with the aim of increasing data FAIRness (Findability, Accessibility, Interoperability and Reusability) in the ESS and beyond.
Formatting statistical classifications for LOD dissemination – from SDMX to RDF
Eurostat bases its second approach on the SDMX terminology, reinterpreted in the context of LOD. While there is no single formal RDF ontology that provides a full one-to-one equivalent for the SDMX Information Model, the most relevant ontology that can cover the modelling of statistical classifications is the Extended Knowledge Organization System (XKOS) [31] which is an extension (for representing statistical classifications) of the Simple Knowledge Organization System (SKOS) [32] that meets domain-relevant community standards and best practices. XKOS is derived from the generic statistical information model (GSIM) [33], a terminology and a conceptual model that defines the concepts relevant to structuring statistical classification metadata. In relation to the SDMX artefacts, XKOS has the added advantage of being compliant with the semantic web technologies and allowing a richer description of the resources, rendering them interoperable and machine-readable [34].
Persistent identifiers for European statistical classifications
Eurostat LOD classifications are defined in the domain ‘data.europa.eu’, with one namespace assigned per classification (for example,
Technology stack for LOD dissemination of European statistical classifications
For the storage and dissemination of Eurostat classifications in RDF, a suite of four semantic platforms offered by the OP is used, building on three operational pillars:
reference data maintenance
visualisation
storing for sharing and re-use,
To summarise, these tools jointly provide a solid back-end for data owners (including Eurostat) to maintain and expose their data assets, a user-friendly front-end for users to discover and view the data assets, and the necessary infrastructure for programmatic access (to allow automatic re-use of these assets).
Reproducibility through blockchain technology
It should be noted that even with DOIs fully implemented for all European statistics datasets, there may be issues with reproducibility, since European statistics data currently are not versioned. For instance, whenever data are revised (for instance to replace preliminary data with final data or to correct for errors), previously disseminated data are ‘overwritten’. In such a case, a researcher or policy analyst trying to reproduce the results of a previous analysis will – even when using the exact same DOI and the exact same selection criteria – arrive at different results.
While this reproducibility issue could technically be resolved by taking a ‘snapshot’ of each disseminated Eurostat dataset whenever it is being updated, this would generate large volumes of near-identical data of little public utility. Eurostat is therefore currently considering using a more ‘lightweight’ approach based on blockchain technology [39]. The approach would essentially entail the following:
Eurostat injects ‘hashes’ (digital fingerprints) of each disseminated version of a dataset into a blockchain. Any researcher or analyst A interested in ensuring the reproducibility of their results would have to download the Eurostat data that they use and (using their own infrastructure) save those data. To credibly demonstrated the reproducibility of their findings the researcher/analyst A would then (on top of sharing their code and the DOI of the data that they have used) also have to share the thus saved Eurostat data in unaltered form. Although the data to which the DOI resolves might have been updated, any other researcher/analyst B wishing to reproduce the findings could then verify that the data used are indeed authentic by checking that the ‘hash’ of the data shared by A does appear in the ‘Eurostat blockchain’.
Thereby, all researchers and analysts wishing to achieve demonstrable reproducibility of their results would have the means to do so. Moreover, by researchers and analysts only saving the dataset versions underlying their analyses, there is no wasteful use of storage space for the various intermediate versions of data that nobody ever uses.
An important step towards new ESS capacity development and an increased quality of the open data dissemination by its NSIs was taken through the adoption of a list of high-value datasets (HVDs) for statistics [40]. Member States must disseminate these HVDs (i) for free, (ii) in machine-readable format – and made available through (iii) APIs and (iv) for bulk download. To support national authorities in their dissemination of HVDs, guidelines [41] have been issued on how to use DCAT-AP [18] for a dataset that is subject to the requirements are set out in the regulation [40] on HVDs.
While the ESS NSIs already do disseminate their statistics in machine-readable format for free, a number of NSIs do not yet expose their data via APIs or bulk download facilities. As dissemination of HVDs by Member States is mandatory, this will serve as an incentive for those NSIs of the ESS that do not yet have facilities for making their data available through APIs (and via bulk download) to develop their infrastructure. This could possibly be done trough collaboration among NSIs to achieve a standardised approach and economies of scale. Once an infrastructure for disseminating statistics HVDs is in place, it will benefit the full range of products of the NSI, since it could be used for the dissemination of all their datasets – not just the HVDs.
Linked open data challenges for official statistics
Making statistical data available in RDF format
The transformation of existing data into LOD increases the opportunities for further collaboration between Eurostat and the NSIs of the ESS for developing, reusing, and linking reference and derived classifications. The main challenge for enabling this interoperability remains the availability of these statistical classifications in RDF. A successful interoperability use case is the availability of correspondence tables established between international and EU statistical classifications (NACE – ISIC, CPA – CPC), accessible remotely on EU Vocabularies [29] or on the Caliper platform [42], a project run by the Food and Agriculture Organization of the United Nations (FAO).
An international Community of Practice
Under the auspices of the ESS Standards Working Group, the LOD Community of Practice (LOD CoP) was launched in April 2023. This initiative, coordinated by Eurostat, includes nine ESS NSIs (the Netherlands, Latvia, Croatia, Hungary, Spain, France, Italy, Finland, Denmark, and Norway), Statistics Canada and the FAO. The LOD CoP is developing use cases and recommendations for:
linking structural metadata to statistical datasets, linking statistical classifications, defining specifications for a common API for retrieving classifications and correspondence tables, linking statistical datasets across data catalogues.
It is not worth investing resources in LOD assets if they are ultimately not used by at least some categories of official statistics users. A part of the reason potentially hampering the use of LOD may be that there are not enough relevant datasets around – so the various initiatives already achieved and underway (as described in 4 and 5.3.1) are a crucial first step.
However, another hurdle to overcome before there could be greater uptake of LOD in the official statistics user community concerns the difficulty to consume them. Zeginis et al. [43] suggest that ‘to unleash the full potential of [LOD] we need to facilitate the interaction with [LOD] and hide most of the complexity’. As an example, even official statistics users with considerable IT skills might struggle with non-traditional query languages such as SPARQL – if they do have experience in query languages, it would typically be those of the ‘SQL’ variety. Some initiatives are already underway to remedy this.
First, Eurostat has developed an R package for automatically generating or updating candidate correspondence tables between two classifications. As described by Karlberg et al. [44], this package is currently being extended to facilitate data ingestion through a function directly accessing classifications and correspondence tables data via SPARQL Endpoint APIs (Cellar of the OP [30] or Caliper of the FAO [42]). Apart from meeting the most pressing needs of official statistics users (‘getting the data’), it also serves a didactic purpose: the SPARQL code used to retrieve the classifications (and correspondence tables) is also returned by the function – thus allowing users to see what SPARQL code of relevance to them looks like. Ideally, this will allow official statistics users with good general coding skills to figure out how to tweak the SPARQL code themselves so that they can apply it for other purposes.
Moreover, as part of the ongoing collaboration between Eurostat and the OP [45], an initiative has been launched to better tailor ShowVoc [37] to statistical classifications through improved formatting, adapted terminology (replacing ‘LOD jargon’ with use terminology commonly used in the classification community) and relevant ergonomics.
Going beyond official statistics
While the initiatives described in this paper focus on official statistics, it has to be borne in mind that to truly unleash the full potential of LOD, a wider group of use cases should also be brought into the picture. In a crisis situation, a policymaker might wish to rapidly get dashboard-like information on a phenomenon from whatever source is available. In principle, LOD was conceived for situations like this, and is designed to allow ‘mashups’ of different sources, such as those envisaged by DiFranzo et al. [46].
Exposing official statistics data as LOD opens up for this opportunity – but the organisations disseminating official statistics might need to reflect on the scope of their role therein. For instance: is it the role of the official statistics community to guide key users (policymakers and policy analysts) in their use (so that the mashups use quality data whenever available), or does our responsibility stop with putting the LOD ‘out there’?
Conclusions
Eurostat does, like the official statistics community in general, have a longstanding tradition of open data dissemination. However, just disseminating data is not enough. Section 1 discusses the communication and dissemination activities, challenges and actions required to ensure trust in official statistics. As outlined in Sections 3 and 4, Eurostat has recently taken considerable steps to add to its regular dissemination vectors (described in Section 2) by making key data assets available in LOD format. Linked Open Data can provide substantial support to scientific research by offering a wealth of interconnected and openly accessible data resources and formats.
Linked Open Data empower scientific researchers by providing a framework for accessing, integrating, and analysing interconnected data. It fosters transparency, reproducibility, collaboration, and interdisciplinary research, driving scientific advancements and contributing to a broader understanding of complex scientific challenges. However, while LOD offer many benefits, several challenges still need to be addressed in order to create an enabling environment for exploiting their full potential. This includes, but is not limited to, enhancing data quality, privacy, curation, full accessibility, interoperability, and sustainability.
Addressing these challenges requires a concerted effort from a range of stakeholders, including policymakers, data providers, and users of open data. As illustrated in Section 5, Eurostat is committed to continue working in this area in collaboration with the key actors at EU level, such as the OP, as well as partners in the ESS and worldwide. Thereby, European statistics will become even more widely accessible to anyone looking for reliable data on Europe – and (as discussed in Section 5.3.4) their interoperability and integration with other sources will become possible.
