Sharing digital trace data: Researchers’ challenges and needs

Abstract

Over the past decade, research has made rapid progress in the collection and analysis of digital trace data. However, when it comes to sharing data, researchers still face major barriers that often limit or prevent the reproducibility of research results and the reuse of data. Against this backdrop, we identify three broader categories of user challenges, namely researchers’ capacities & incentives, legal & ethical challenges, and technical hurdles. We describe in detail the problems researchers face in each category and why these often prevent researchers from sharing data, thus limiting both the reproducibility of research outputs and data reuse in other research projects. We conclude each category with specific needs of researchers for sharing digital trace data and making it reusable for others. These are intended to provide researchers as well as research institutes and repositories with approaches to improve the situation of data sharing.

Keywords

Digital trace data data access research data management open science reproducibility data archives

Introduction

In this commentary, we address the challenges of researchers that have collected digital trace data (DTD) and want to share their data for re-use with other researchers (see GESIS, 2023 for a definition of DTD¹). Based on an extensive literature review, we identify three broader challenges. Within these categories, we highlight the specific needs of users for sharing DTD and making it reusable. While some of the problems are known for other data types, sharing DTD poses distinct challenges like their personal and sensitive content or the substantial size of average datasets (Breuer et al., 2020). Possible solutions include adapted metadata standards that contribute to the findability and comparability of DTD or centralized collections of legal documents such as platform terms of service (ToS) that affect the distribution and reuse of data.

Researcher capacities and incentives

The first challenge is that sharing DTD requires knowledge and capacities. Making this type of data reusable is often resource intense beyond “just” preparing a dataset. Reusability for DTD often involves organizing, storing, and documenting data and code in a manner that facilitates other researchers to modify them according to their requirements (Van Atteveldt et al., 2019). In many cases, it is essential to share the code that was used to collect the data to understand the dataset and to use it for further research. A primary obstacle to promoting reusable code lies in the considerable time investment. This process does not directly contribute to the developer's individual research as software development, particularly in terms of maintenance and support, is frequently not recognized as a distinct academic contribution (Van Atteveldt et al., 2019).

Furthermore, the incentives for making data available are lower as there is no common citation standard for data, comparable to the lack of citation standards for software (Zenk-Möltgen et al., 2018). Additionally, this can vary significantly between different research communities. Embracing reusability practices already from the outset of a research project would be more time-efficient compared to the effort required to collect, clean, and document data and code after the submission has been accepted (Van Atteveldt et al., 2019). This would foster a research environment where considering data fairness becomes normal practice (Leonelli et al., 2021). Otherwise, researchers might not just shy away from investing additional resources into making data available but even fear negative consequences such as loss of rights over data and “first rights” loss (Akdeniz et al., 2023). This can lead to counterproductive outcomes since the sharing of high-quality datasets could mitigate the allocation of significant amounts of time, effort, and eventually research funding towards tasks that have already been carried out by others (Weller and Kinder-Kurlanda, 2015).

Acquiring the knowledge for sharing research data in a sound manner can seem costly even if this is not always the case. Data management plans, guidelines like the FAIR Principles, and standards for metadata and other types of documentation are “intended to facilitate data management and data sharing […] and ease the flow of data from researchers to permanent repositories” (Hemphill et al., 2021). However, they require knowledge that might be neither necessary for collecting nor analyzing the data. This is especially apparent in the case of DTD where a badly documented dataset might not be re-useable at all. Even worse, subsequent users of the data could draw erroneous conclusions if the data is used without understanding the data-generating process.

Graphical interfaces can oversimplify the complexity of data, offering limited insight into the comprehensive nature of the downloaded content by presenting only a restricted set of variables (Sloan et al., 2020). It is crucial for researchers to acknowledge and disclose the constraints of their comprehension regarding the data when publishing. For instance, the processes involved in how data is sampled and provided through APIs are often unclear, and there is often a lack of understanding regarding systemic biases in social media usage. Institutions and platforms where social media research is conducted should promote an environment where their staff is encouraged to acknowledge and explore these issues (Leonelli et al., 2021). However, there is also a positive effect that comes with nature of DTD: Data that is inherently “born-digital” facilitates more effortless and meaningful data sharing. Most analyses are guided by open-source scripts or tools that can be scripted, and the accessibility of free computing resources significantly enhances the convenience of sharing both data and code (Van Atteveldt et al., 2019).

What might support researchers in this endeavor and set incentives for sharing DTD? There will only be positive incentives to invest in the provision of high-quality data if the publication of data is of comparable importance to the publication of articles for the profile of a researcher (especially on the job market). Researchers who are willing to make their data available would benefit if the sharing of data collected in research projects were made an institutionalized default option. Journals, funding lines, and research institutes should enforce data sharing and only accept not making data available if there is no appropriate way of doing so. Furthermore, these institutions should treat data citation as good scientific practice in the same way as article citations. This is even more the case for software publications. Katz et al. (2024) describe possible approaches like the infrastructure initiatives that support the preservation, discovery, reuse, and attribution of software.

While analyzing data is an essential part of social science training, the sharing of data is often neglected. This aspect should be integrated into university curricula. However, trained researchers also require better guidelines and support with making data available for reuse both from universities but also especially from repositories with the respective expertise. This starts with the fundamental question of which services can be used to make data available and what types of data researchers can share. A question that is not easy to answer at present, as many service organizations are not yet equipped to deal with the variety of DTD, both in technical terms and regarding the challenging legal and ethical issues outlined below.

Legal and ethical restrictions

While legal restrictions affect all parts of the research data cycle, sharing and re-using DTD is particularly affected (Bruns, 2019). As such data has only recently been used for research purposes, no legal practice has yet been established. In the absence of precedencies, giving legal advice is often about risk evaluation and minimization. Legal restrictions for sharing digital behavioral data can touch upon different legal areas:

The first major legal challenge is the large amount of personal data. This makes informed consent either impossible or associated with high costs. Furthermore, in the European context GDPR imposes a purpose restriction for personal data which can be an obstacle for data re-use and sharing (Akdeniz et al., 2023). Additionally, it “requires following a principle of data minimization, […] a notion that is directly at odds with preserving data for secondary analyses” (Van Atteveldt et al., 2019). It is important to note that GDPR does not prohibit archiving and publishing data if appropriate data protection measures like access controls are in place (Breuer et al., 2021). However, researchers oftentimes need legal advice on how this applies to a specific dataset.

Second, platform-specific ToS can restrict researchers’ data access and sharing affecting the replicability of their research (Hemphill et al., 2021). The fact that ToS vary significantly between platforms and are changing over time makes the situation more complicated. This problem became particularly apparent with the shutdown of the academic Twitter API. While issues with using the API for data rehydration existed before the shutdown, due to variations in the platform's sampling strategy when data are requested, the current situation has rendered data access entirely unfeasible for many researchers (Assenmacher et al., 2022).

The last legal area concerns intellectual property rights. Especially social media users often incorporate third-party content into their posts. Given that copyright grants exclusive rights to creators for a limited time, intellectual property concerns may arise in social media data archiving (Breuer et al., 2021). This is particularly challenging when the data contains images or content from media outlets (Akdeniz et al., 2023). It is important to keep in mind that companies aim to protect their business assets, and freely sharing data poses risks, including competitors utilizing it or potential customers accessing it without compensation to the collecting company (Breuer et al., 2020).

While ethical challenges are often connected to legal challenges, they offer a different perspective and introduce additional challenges. Despite the growing number of studies, Taylor and Pagliari (2018) find “a deficit in ethical guidance for research involving data extracted from social media.” A challenge that is particularly pronounced for DTD is the threat from data linkage even if sharing singular datasets might seem unproblematic: The increasing ability to link data implies that any entity in possession of data, particularly personal or sensitive information, faces a growing and challenging responsibility to prevent unauthorized linkage (Bishop and Gray, 2017). Common variables in social media or web tracking dataset like time stamps or location might be used to identify subjects when combined with data from other sources. For these reasons “careful consideration must be given to whether repository conditions match confidentiality commitments given to data subjects” (Bishop and Gray, 2017). Furthermore, as with other data types it is important to ensure the utilization of suitable data management plans and to coordinate the collection of data in collaboration with the relevant institutional entities, such as the ethical committee or IRBs (Van Atteveldt et al., 2019).

What are measures that would support researchers with legal and ethical challenges? As these areas are very complex, guidelines and training are needed to inform researchers about the legal and ethical aspects to be considered for different types of data. These materials need to contain information on when data can(not) be shared, what possible restrictions or conditions for data sharing are and when the legal situation is not clear. Another way forward might be “to develop a shared academic data license and verifiable standards and procedures for dealing with data” (Van Atteveldt et al., 2019). Furthermore, ethical boards and legal advice committees should be coordinated more to avoid situations where research teams from different institutions face conflicting obligations. A concrete measure regarding platform data are collections of relevant ToS and their implications. Compared to the other legal area mentioned above, information on the legal restrictions is platform dependent, rapidly changing and might not be available if researchers did not save them during the data collection. Golland et al. (2025), for example, assemble an archive of Twitter/X's policies for Tweet redistribution. This can serve as a blueprint for centralized guidance for researchers.

Despite a growing awareness and training material for researchers on ethical data management in general, “training specific to big data, research ethics, and research integrity” is often still needed (Bishop and Gray, 2017). Guidelines for ethical sharing and preservation have only recently started to emerge. IRBs may set regulations that inﬂuence the reusability of collected data. However, ethical guidelines can vary significantly between research institutes (Assenmacher et al., 2022). In research teams with different affiliations this can lead to the situation that either multiple, competing ethical standards need to be followed or that approval is only considered from one institution and might not meet the standard of the others. This is even more the case when different disciplines are involved. Thus, research institutes would benefit from closer cooperation and coordination of IRBs setting common standards and drafting common guidelines.

Technical challenges

Apart from the legal and ethical challenges, there are also technical challenges. Despite researchers being open to sharing data upon request or making it available through repositories, they lack guidance on how to document their data. Currently, there is no established metadata standard for DTD (Hemphill et al., 2021). This is true both for the description of datasets as well as the documentation of the collection process. What makes it even more difficult is that DTD are relevant for different disciplines and is documented differently depending on individual research interests. Due to the variations among social media platforms and the ever-evolving nature of the social media landscape, defining standardized elements proves challenging.

Furthermore, it is crucial to share and document the pre-processing code for datasets. This documentation should encompass detailed insights into data transformation processes, alongside information about the tools and methods employed. Additionally, researchers should provide the code utilized in data collection, particularly when APIs or web scraping methods are used. The documentation should cover details like search terms and applied filters (Breuer et al., 2021). Otherwise, researchers might neither be in the position to evaluate the analysis that has been conducted with the data nor re-use it for their own research. Finally, the magnitude in terms of both file size and dimensions as well as the speed at which DTD are generated is much higher compared to most other data types. The increased velocity raises concerns regarding the collection, updating, and versioning of data (Breuer et al., 2021). This increases the demands on both the technical infrastructure of researchers and that of repositories.

What are possible solutions to these technical challenges? Infrastructures for data sharing need to provide user-friendly technologies that make data sharing accessible for researchers (Zenk-Möltgen et al., 2018). To support researchers, archives can provide best practice material and provide better information about the options for making data available. Furthermore, researchers need repositories that can provide technically appropriate access to large volumes of data. This means, for example, making large data sets available via APIs instead of (manual) downloads. Repositories also need to be able to handle sensitive data and use access control systems such as remote desktop access to give only authorized persons access to the data and ensure that it cannot be misused.

Moreover, researchers need consistent common standards to describe and prepare DTD for reuse with appropriate metadata and controlled vocabularies. Both researchers who have collected data and those who want to reuse this data need clear standards to be able to understand what data has been gathered, how it was obtained and for what types of research it can be used. This is the case for new DTD-specific descriptions like version number of the API and the computational methods used for data retrieval (Breuer et al., 2021). However, examining the applicability of established descriptions with metadata and controlled vocabularies is just as relevant. A frequent challenge in this context is the specification of a study period for a data set. While the survey period for survey data, for example, corresponds to the time period of the data, social media data may be collected via APIs long after its creation. A differentiation of social science metadata standards would prevent possible confusion here. In this context, it is important that universities and repositories comply with and build on existing standards for metadata descriptions like DDI (Akdeniz and Zenk-Möltgen, 2017).

Conclusion

This commentary began by examining the current challenges and requirements researchers face in sharing DTD. We identified three main categories of challenges: researcher capacities and incentives, legal and ethical restrictions, and technical challenges. While comprehensive solutions remain elusive, we summarize existing approaches that may help mitigate these challenges. These approaches involve contributions from researchers, archives, research institutes, and the wider scientific community.

Researchers have limited capacity to influence many of the structural barriers to data sharing. However, they exert direct control over research processes, including data collection, cleaning, and documentation. The early and consistent application of open science practices can substantially reduce the effort required for later data sharing. By ensuring transparency in data creation and processing, researchers facilitate informed reuse by others. Nevertheless, sharing data and code entails additional effort. To encourage researchers to engage in these practices, research institutes and funding bodies should establish appropriate incentives, such as integrating data-sharing achievements into existing reputation systems. Additionally, measures should be implemented to prevent competitive disadvantages for those who invest time and effort in making their research more accessible.

Beyond incentives, research institutions play a crucial role in supporting researchers by providing guidance, training, and information on data publication. This includes assistance in selecting and utilizing internal and external services capable of handling DTD, as well as support in navigating complex legal and ethical considerations. Addressing legal and ethical challenges requires institutional engagement as well. Ethics committees should be equipped to assess cases involving DTD in an informed and practice-oriented manner. Progress could be achieved through the development of standardized ethical guidelines, which would provide clearer guidance for researchers and facilitate cross-institutional and international collaboration.

Research data infrastructure providers must also adapt to evolving research practices and data types. Their role is to develop user-friendly services that enable efficient data sharing. This involves not only technological advancements capable of managing increasing data volumes but also the dissemination of best practices and guidance on legal and ethical aspects of data sharing. Moreover, infrastructure providers should contribute to the development of standardized metadata and ontologies for describing DTD, thereby enhancing data discoverability and interoperability.

In summary, overcoming the challenges associated with DTD sharing requires a concerted effort from multiple stakeholders. Researchers can improve transparency and facilitate reuse through rigorous documentation and adherence to open science principles. Institutions should create supportive environments by providing training, resources, and incentives. Ethics committees must work toward standardized guidelines, and infrastructure providers should develop robust, user-friendly solutions. Only through the combined efforts of these actors can the scientific community advance toward more transparent, ethical, and efficient data-sharing practices.

Footnotes

Declaration of conflicting interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The authors received no financial support for the research, authorship, and/or publication of this article.

ORCID iDs

Jan Schwalbach

Reiner Mauer

Notes

References

Akdeniz

Borschewski

Breuer

, et al. (2023) Sharing social media data: The role of past experiences, attitudes, norms, and perceived behavioral control. Frontiers in Big Data 5: 971974.

Akdeniz

Zenk-Möltgen

(2017) DDI-lifecycle at the data archive: The metadata schema for documentation in different software tools. (GESIS Papers, 2017/18). Cologne: GESIS - Leibniz-Institut für Sozialwissenschaften. https://doi.org/10.21241/ssoar.52487

Assenmacher

Weber

Preuss

, et al. (2022) Benchmarking crisis in social media analytics: A solution for the data-sharing problem. Social Science Computer Review 40(6): 1496–1522.

Bishop

Gray

(2017) Ethical challenges of publishing and sharing social media research data. In The Ethics of Online Research, vol. 2. Leeds: Emerald Publishing Limited, pp. 159–187.

Breuer

Bishop

Kinder-Kurlanda

(2020) The practical and ethical challenges in acquiring and sharing digital trace data: Negotiating public-private partnerships. New Media & Society 22(11): 2058–2080.

Breuer

Borschewski

Bishop

, et al. (2021) Archiving social media data: A guide for archivists and researchers. (1.2). Zenodo. https://doi.org/10.5281/zenodo.5041072.

Bruns

(2019) After the ‘APIcalypse’: Social media platforms and their fight against critical scholarly research. Information, Communication & Society 22(11): 1544–1566.

GESIS (2023) Digital Behavioral Data. https://www.gesis.org/en/institute/digital-behavioral-data.

Golland

Recker

Schwalbach

(2025) An archive and corpus of Twitter/X's policies for Tweet redistribution 2006-2023. GESIS, Köln. Datafile Version 2.0.0. https://doi.org/10.7802/2847.

10.

Hemphill

Hedstrom

Leonard

(2021) Saving social media data: Understanding data management practices among social media researchers and their implications for archives. Journal of the Association for Information Science and Technology 72(1): 97–109.

11.

Katz

Gruenpeter

Vrouwenvelder

, et al. (2024) Progress in including research software in the scholarly communication ecosystem. In: Year of Open Science Culminating Conference. Zenodo. https://doi.org/10.5281/zenodo.10849689

12.

Leonelli

Lovell

Wheeler

, et al. (2021) From FAIR data to fair data use: Methodological data fairness in health-related social media research. Big Data & Society 8(1): 20539517211010310.

13.

Sloan

Jessop

Al Baghal

, et al. (2020) Linking survey and Twitter data: Informed consent, disclosure, security, and archiving. Journal of Empirical Research on Human Research Ethics 15(1–2): 63–76.

14.

Taylor

Pagliari

(2018) Mining social media data: How are research sponsors and researchers addressing the ethical challenges? Research Ethics 14(2): 1–39.

15.

Van Atteveldt

Strycharz

Trilling

, et al. (2019) Computational communication science| toward open computational communication science: A practical road map for reusable data and code. International Journal of Communication 13: 20.

16.

Weller

Kinder-Kurlanda

(2015) Uncovering the challenges in collection, sharing and documentation: The hidden data of social media research? Proceedings of the International AAAI Conference on Web and Social Media 9(4): 28–37.

17.

Zenk-Möltgen

Akdeniz

Katsanidou

, et al. (2018) Factors influencing the data sharing behavior of researchers in sociology and political science. Journal of Documentation 74(5): 1053–1073.