Sage Journals: Discover world-class research

Abstract

In recent years, the advancement of open science has led to data sharing becoming more common practice. Data availability has clear merits for science because it opens up possibilities for reuse of data sets by others, leading to less redundancy, more efficiency, and more transparency. The ideal is for scientific data to be as open as possible and findable, accessible, interoperable, and reusable (FAIR). Parallel to this development, recent times have seen more stringent guidelines with respect to data privacy, culminating in the General Data Protection Regulation law. Navigating the balance between protecting participants’ privacy and making one’s data set as open as possible can be challenging for researchers. In this article, we provide two worked examples with real data sets from the behavioral and social sciences on how to be as open as possible and as closed as necessary with the goal of maximally facilitating science while minimizing the risk of participant identification.

Keywords

open data FAIR GDPR de-identification anonymization pseudonymization

Data collected from human participants are central to the behavioral and social sciences. Typically, researchers who work with quantitative methods go through the following workflow: generate hypotheses, design a study, collect data, test hypotheses, interpret the results, and publish a scientific article. After generating hypotheses and thinking through the design of a study, the data-collection phase of the research cycle is often very laborious and time-consuming. Academic studies are often publicly funded and thus indirectly paid for by society at large. Despite that, the scientific publication at the end of the research cycle, which is based on the collected data, typically contains summary statistics but not the underlying data. The inaccessibility of data in scientific publications hampers the process of building new knowledge in three important ways.

First, inaccessibility of data in scientific publications makes it difficult for reviewers to detect potential mistakes in the statistical analyses of the study, undermining the usual processes that supposedly provide mechanisms for scientific self-correction. Mistakes in key statistical claims of scientific articles are common (e.g., erroneous claims of statistical significance, incomplete or even incorrect method descriptions, errors in the conducted data analyses, errors in the reporting of statistical results) and cannot be detected if the underlying data are not shared. Recently, Artner et al. (2021) attempted to reproduce 232 main statistical claims for which underlying raw data were available. After lots of trial and error, the authors were able to successfully reproduce approximately 70% of the statistical claims. This sizable proportion of nonreproducible results (30%) highlights the critical need for scientific articles to include their underlying data.

Second, the lack of accessible data underlying scientific articles hinders other scientists from independently verifying and expanding on the study findings without directly contacting the authors. Towse et al. (2021) reported a prevalence of open research data in psychology of 4%. Even if efforts to contact the authors are made, such requests are often to no avail. For example, in a secondary data-analysis project, Wicherts et al. (2006) requested data from 249 studies that were published in journals that had signed the American Psychological Association (APA) Certification of Compliance With APA Ethical Principles, which stated that, in principle, data should be shared on request (APA, 2001, p. 396). Included in their request was the promise that the data would not be shared with third parties, the approval of the ethics committee, and a description of their study aim (which was to reanalyze the data sets). Ultimately, the authors were able to obtain data from only 64 of those studies (25.7%). Note that the availability of data on request was, even in 2006, considered good practice in many journals and often explicit policy. The main barriers researchers give for not sharing data are that data sharing is not common in their field, preparing data for sharing is too time-consuming, and they never learned to share data (Houtkoop et al., 2018).

Third, not sharing the underlying data is hugely inefficient from a scientific perspective. Research teams that do not collaborate with the original authors and have different but related research questions will have to collect data from scratch. If such research teams had the original data set available to them, they might be able to answer these questions without the need to collect new data. This duplication of effort wastes time and resources and delays the advancement of scientific knowledge. Given that academic studies are often publicly funded and thus indirectly paid for by society at large, there is a moral obligation to maximize the utility of the data collected. Access to original data can facilitate meta-analyses, systematic reviews, and the development of new methodologies, enhancing the overall robustness and reliability of scientific conclusions. By promoting data sharing, the scientific community can foster a more collaborative and efficient research environment, ultimately driving innovation and discovery at a much faster pace.

To enhance accessibility of underlying data in scientific publications, the implementation of the so-called FAIR (findable, accessible, interoperable, and reusable) principles (Wilkinson et al., 2016) and data de-identification (e.g., Portage COVID-19 Working Group et al., 2020) are crucial. Whereas the FAIR principles provide a structured approach to making data findable, accessible, interoperable, and reusable—ensuring that data sets are managed in a way that maximizes their utility and accessibility—data de-identification ensures protection of participants’ privacy. However, practical guidance on proper de-identification when sharing FAIR data is lacking. In this tutorial, we provide two worked examples from the behavioral and social sciences showing how researchers can implement the FAIR principles and de-identify their data before sharing the data underlying their scientific article.

Implementing the FAIR Principles

To address the challenges of reproducibility and reusability in research, Wilkinson et al. (2016) developed the FAIR guiding principles for scientific data management and stewardship. But what does FAIR data entail? Findable means that data are assigned a unique identifier and registered in a searchable database. Accessible means that data are retrievable by their identifier using a standardized protocol; this protocol is open, free, and universally implementable and allows for authentication and authorization; and metadata—a set of machine-readable data that provide structured information about a data set, such as title, authors, and keywords—are accessible even when data are no longer available. Interoperable means that the (meta)data are machine-readable (i.e., in a format that can be read through an electronic device for interpretation and manipulation by a computer). Interoperable data thus use a formal, shared, and broadly applicable language for knowledge representation and include potential references to other data. Finally, reusable means that data are accurately and richly described (i.e., accompanied by machine-readable metadata and human-readable data documentation that meet domain-relevant community standards) and include a standardized license to grant the public permission for data reuse under copyright law (e.g., creative commons licenses).

The FAIR principles have rapidly gained widespread attention. Funding agencies recognize the importance of reusability of research data and stating data availability as a condition for funding (e.g., the Dutch Research Council and the Horizon Europe of the European Commission). These funding demands rapidly required adoption of the FAIR principles across various research fields. Adams et al. (2023) developed checklists for researchers in seven different disciplines to provide guidance with respect to making data FAIR. The authors concluded that disciplinary needs vary, influenced by researchers’ familiarity with FAIR principles, the relative prevalence of small qualitative versus large quantitative data sets, the presence of commercially sensitive data, and institutional mandates. For the field of neuroscience, Behan et al. (2023) described the Brain-CODE platform for research data and how data-sharing activities in the platform align with FAIR principles. The platform features rich metadata, access on request, a limited number of accepted data formats, and colorful data-exploration dashboards that enhance FAIR compliance. In the context of mental-health research, Sadeh et al. (2023) provided a table with FAIR applications. Despite these advancements in various fields, a comprehensive and unified guideline for applying FAIR principles to the behavioral and social sciences remains absent.

A consequence of the wide adoption of FAIR principles is that interpretations of what it means for data to be FAIR have diverged (Jacobsen et al., 2020). Given the diverse interpretations of FAIR principles, in the present article, we aim to clarify their application by presenting a detailed, step-by-step guide to the 15 FAIR principles as outlined by Wilkinson et al. (2016) for behavioral and social sciences, including practical examples and in-depth analyses.

Challenges in Sharing FAIR Data in the Behavioral and Social Sciences

The open-science movement and the introduction of the FAIR principles invited researchers to make their data openly available for reuse as widely and as early as possible (e.g., Huston et al., 2019; Janssen et al., 2012; Nosek et al., 2012; Nuijten, 2019; Ramachandran et al., 2021). However, there can be legitimate (legal and/or ethical) reasons to shield certain types of data. For instance, in the field of behavioral and social sciences, scientists aim to understand the interactions and factors that shape human behavior in response to their environment. Scientists have a moral obligation to protect the dignity, rights, and welfare of humans participating in their research (Grodin & Annas, 1996; World Medical Association, 2013; Nethics code, 2018). This duty of care translates into the way that researchers design scientific studies but should also be reflected in responsible reuse of the (personal) data collected from their participants.

In 2016, the General Data Protection Regulation (GDPR) was adopted in the European Union with the aim to protect the rights and freedoms of individuals. Although the GDPR has been defined as “the toughest privacy and security law in the world” (GDPR.eu, 2022), imposing obligations on organizations that are processing personal data, it also recognizes the importance of scientific research. Similar to other organizations, research institutes need to have a legal basis before processing personal data (e.g., consent or public interest; GDPR Article 6), be transparent to research participants, collect and process personal data for the communicated purpose only, and keep these data accurate and secure (GDPR Article 5). In contrast to nonscientific organizations, however, research institutes may store personal data for the purpose of archiving (and verification). Moreover, under strict conditions (GDPR Article 89), if data were collected on a different legal ground than consent, researchers may reuse personal data that have previously been collected for the purpose of scientific research because this is never considered incompatible with the initial purpose (GDPR Art 5[1][b]; GDPR Recital 50).

These exemptions to certain obligations of the GDPR for the purpose of scientific research make it possible to keep certain data available for verification purposes and sometimes even reuse while making sure that extra technical and organizational measures are in place to protect the personal data. These measures include pseudonymization, secure storage and sharing solutions, and possibly data-sharing agreements in case of sharing data between different organizations (GDPR Article 89[1]). From an ethical point of view, participants should be informed about the processing of personal data and give consent to sharing their pseudonymized data for scientific research before enrollment in a study, even when it is not legally required.

There is a middle ground between making data publicly available and completely blocking access. Several platforms (e.g., Zenodo [https://Zenodo.org/], DANS data stations [https://dans.knaw.nl/en/data-stations/], and dataverseNL [https://dataverse.nl]) allow for access to specific data sets on reasonable request. Even though these platforms enable responsible data sharing, restricted access procedures require significant time, effort, and sometimes monetary costs from researchers and data-support teams. This challenge is particularly acute in the behavioral and social sciences, in which conditionally sharing data is not a standardized practice yet.

Balancing Privacy and Research: Effective De-Identification Techniques for Human-Subject Data in Behavioral and Social Sciences

The main hurdle when making data publicly available while working with human-subject data in the field of behavioral and social sciences remains that researchers have the obligation, both morally and legally, to protect the rights and freedoms of their participants and protect the data concerning them (GDPR, Article 1[2]). A solution to make these data available while complying with moral codes and European privacy legislation is to make the data set anonymous before publishing. The GDPR defines anonymous data as information that does not relate to an identified or identifiable natural person. Processing and storing of anonymous data is not regulated by the GDPR because once data are anonymized, it is no longer considered as personal data, and therefore, the data-protection principles no longer apply (GDPR, Recital 26). This means that anonymous data can be shared and reused without any of the conditions that are mentioned in the GDPR.

An anonymized data set should no longer include information that can directly and unequivocally identify an individual (i.e., direct identifiers), such as names and social-security numbers. However, researchers should also be careful with including indirect identifiers in their data sets, such as age, ZIP codes, gender, occupation, and place of residence. Although these elements may not directly reveal an individual’s identity on their own, they could potentially be used in conjunction with other variables to indirectly identify a person. For example, extra information about the participant pool can often be derived from a published research article (e.g., students from a particular university and cohort), which in combination with the published data could potentially identify certain individuals in the data set. Fung et al. (2010) provided an extensive summary and evaluation of different formal approaches toward data de-identification. Arguably, the best way to assess whether de-identification has succeeded is that no data can be reidentified. El Emam et al. (2011) documented known attempts at data reidentification and found an overall success rate of reidentification attacks of 26%. However, the authors cautioned that many of the successful cases of reidentification were performed on relatively small data sets.

Making data sets publicly available involves a careful de-identification process to protect the privacy of individuals whose data are included in the data set. Several frameworks exist to aid researchers in making informed decisions regarding data set de-identification. These frameworks offer diverse methodologies and guidelines, catering to specific data types, research goals, and privacy concerns, emphasizing the importance of selecting the most suitable approach for preserving privacy while facilitating research. For example, the Canadian de-identification guide (Portage COVID-19 Working Group et al., 2020) covers topics such as identifying and removing direct identifiers, evaluating indirect or quasi-identifiers based on risk and utility, assessing the sensitivity of nonidentifying variables, and considerations for de-identifying qualitative data. In addition, it briefly touches on de-identification considerations for social media, medical images, and genomics data.

The five-safes framework is embraced by several Trusted Research Environments in the United Kingdom, including Health Data Research-UK and the National Institute for Health Research Design Service, and provides a useful framework for identifying what the elements are in a data set that are potentially sensitive and what levels of pseudonymization/anonymization are possible. The framework was originally developed in 2003 and was conceived to assist decision-making about data usage that may be sensitive or confidential (http://fivesafes.org; Desai et al., 2016). Inspired by this framework, the Dutch coordination point for research data management (LCRDM) came up with a matrix for the assessment of privacy risks with research data and for the determination of appropriate methods for risk management of data availability. The matrix identifies five different levels of data safety that differ in the level of anonymization (i.e., identifying information is fully removed) or pseudonymization (i.e., identifying information is replaced by pseudonyms and an identifiable combination of variables is accounted for) of the data (LCRDM, 2019). Table 1 provides a schematic overview of the five levels.

Table 1.

Five Levels of Pseudonymization/Anonymization

	PS0: not pseudonymized	PS1: pseudonymized at Level 1	PS2: pseudonymized at Level 2	PS3: pseudonymized at Level 3	ANON: anonymized
Patient	Name: Rutger Hauer Patient #: 90210 E-mail: blade.runner@batman.nl	Patient #: 90210	Study participant: 47110009	Study participant: 47110009	—
Area	Postal code: 8911AA City: Leeuwarden	Postal code: 8911 City: Leeuwarden	Region: Friesland	Country: The Netherlands	Country: The Netherlands
Age	Date of birth: April 27, 1967	Date of birth: April 27, 1967	Year of birth: 1967	Age: 51–60	Age: 51–60
Income/job	Income: 7,861 Job: judge	Income: 7,861 Job: judge	Income: 7,500–10,000 Job: legal	Income: 5,000–15,000 Job: legal	Income: 5,000–15,000 Job: legal
Car	Car: DeLorean License plate: SN-09-HN	Car: DeLorean License plate: SN-09-HN	Car: DeLorean	Car: sports car	Car: sports car

Note: The table shows a fictional participant as identified with their full name, patient number, email address, postal code, city of residence, date of birth, income, job description, and car details under the most “open” level (PS0: not pseudonymized). At PS1, pseudonymized at Level 1, the full name and email address are removed. At PS2, pseudonymized at Level 2, the patient number is replaced by a study participant number, postal code and city of residence are replaced by a general area, date of birth has become year of birth, the exact income is replaced by an income range, and the license plate of the car is removed. At PS3, pseudonymized at Level 3, the general area of living is replaced by the country, year of birth is replaced by an age range, the income bracket has widened, and the car’s brand name is replaced by “sports car.” At the final, fully anonymized level, the study participant number is also removed. Please be aware that Recital 26 of the General Data Protection Regulation states that pseudonymized data can still be considered personal data.

Source: Adapted from the Dutch coordination point for research data management (LCRDM, 2019).

As these de-identification frameworks show, a qualitative approach is necessary to identify and classify variables in terms of direct or indirect identifiers. After removing direct identifiers and deciding on the quasi-identifiers that are present in a data set and to help assess if a data set has been anonymized sufficiently, quantitative approaches for evaluating reidentification risk can be used. For example, the privacy model k-anonymity ensures that each record in a data set is indistinguishable from at least k – 1 other records with respect to certain attributes (El Emam & Dankar, 2008; for an accessible introduction, see Morehouse et al., 2024). The “k” value, an integer chosen by the researcher (e.g., 2), thus denotes the minimum number of records with identical quasi-identifiers, forming an equivalence class. The key principle is that in an equivalence class, it should be impossible to distinguish one record from the others. There are different variants of this model—such as l-diversity, t-closeness, and b-likeness—and SPSS and STATA code is available in Appendix 1 of the Portage COVID-19 Working Group et al. (2020) to create equivalence classes based on the quasi-identifiers in the data set and to list them by size. There are packages available for R (e.g., sdcMicro R package), open-source online applications (e.g., MinBlur and MinBlurLite; Morehouse et al., 2024), and point-and-click software tools that support a wide variety of privacy and risk models, such as ARX (https://arx.deidentifier.org/) and AMNESIA (https://amnesia.openaire.eu/). The output of these pseudonymization tools (i.e., k-anonymity, 1-diversity, t-closeness, b-likeness) heavily depends on its input: what variables are marked as potential indirect or quasi-identifiers.

Outline of the Present Tutorial

In this tutorial, we provide two worked examples from the behavioral and social sciences showing how researchers can de-identify their data and implement the FAIR principles before sharing the data underlying their scientific article. We address the 15 principles of data FAIRification using existing empirical data sets as examples. Given the uniqueness of each data set, universal guidelines are impractical. However, we posit that presenting representative scenarios and the decision-making processes involved can assist empirical researchers in making informed and responsible choices when sharing their anonymized FAIR data sets.

Throughout this article, we present a version of both data sets that includes all variables, including those that are to be removed during de-identification, so that the reader may see the difference between the preprocessed and postprocessed data. From an educational perspective, it would be ideal to be able to present both data sets in their raw, completely unprocessed form. This way, the reader may compare them with their respective postprocessed versions. Unfortunately, such an approach would defeat the very purpose of this tutorial article because we would make data available that are not properly de-identified.

The solution we have implemented in this article is replacing the recorded answers under the variables that are going to be pseudonymized or removed in the de-identification process with simulated data. For our two example data sets, we make available both the full data set (with entries for sensitive variables that require de-identification replaced by simulated values) and the curated data set (with only the necessary but sufficient variables). Consequently, the original preprocessed data that we make available for inspection is in a sense a hybrid data set, containing all original variables but only some variables (i.e., those that are retained in the postprocessed data) still have the original data entries as supplied by the participants. The use of simulation in this article serves educational purposes only—it is not employed as a tool for de-identification in the preparation of the postprocessed data sets.

All preprocessed and postprocessed data sets, metadata, and related scripts are available at https://osf.io/eqbd3/. For both data sets, variables that were used in the original analyses are described in the Appendix.

The remainder of this article is organized in two parts. In Part 1, we integrate existing frameworks designed to assist researchers in making informed decisions regarding data-set de-identifications into a step-by-step de-identification process and demonstrate their application through two practical examples. In Part 2, we detail the process of rendering the data sets FAIR and open and offer our recommendations for evaluating the success of the FAIRification process.

Part 1: Step-by-Step De-Identification Process

The process of de-identifying data when preparing data sets for publication involves several steps. Building on the existing frameworks for de-identification, we present a step-by-step de-identification guide for the social sciences that covers the whole de-identification process based on the anonymization plan from Radboud University Nijmegen (van der Burgt et al., 2024), which is, in turn, based on a template from the Finnish Social Science Data Archive (Finnish Social Science Data Archive, 2024; see Box 1). Our de-identification guide is also available on https://osf.io/eqbd3/ as a stand-alone document.

Step 1: Describe your data set

A. Describe your participant pool (What is your population? What was the sampling method? Rare phenomena?).

B. Describe the age of your data set (In what period did data collection take place?).

C. Distinguish between essential and optional variables (Which variables are used in the main analyses? Which variables were descriptive or collected for different studies?).

D. Identify and classify the identifiability of variables in your data set (Which variables are direct or indirect identifiers?).

E. Determine other sources of information that might influence the identifiability of individuals in your data set (Are there any third persons that might be aware of the people in your data set? Are there any other data sources available that might in combination identify an individual in your data set?).

Step 2: Determine the identifiability of your data set

A. Weigh the perceived risk and utility of the variables in your data set using the five-safes framework.

B. Consult the local data steward and/or privacy officer to assess identifiability.

C. Determine whether it is useful to apply quantitative methods to quantify the identifiability of your data set (e.g., k-anonymity, t-closeness, l-diversity) and apply appropriate quantification methods.

Step 3: Design de-identification techniques for the variables in your data set

A. Remove (direct) identifiers from your data set.

B. Define and apply appropriate de-identification techniques to variables in your data set in collaboration with the data steward and/or privacy officer.

Step 4: Go back to Step 2 until the data are sufficiently de-identified.

Step 5: Document the de-identification procedure and archive/publish this with the data.

De-Identifying and Publishing Data: Two Examples

In the following sections, we provide a brief description of the two empirical data sets used as examples in our study. These data sets originate from the fields of organizational psychology and experimental psychopathology; however, many aspects of the data treatment are relevant to other domains in the behavioral and social sciences.

Step 1: Describe Your Data Set

Step 1A: describe your participant pool

Data Set 1

This data set, from the field of organizational psychology, is about age differences in boundary management during telework. Employees of various ages completed two surveys in which they reported on their boundary-management behavior, well-being, and productivity, with the goal of (a) developing a new scale of boundary-management tactics specifically tailored to the context of telework and (b) examining the mediating role of boundary-management tactics for age differences in teleworkers’ work–life balance and productivity. To this end, a sample of employees ages 18 to 70 years, working at least 20 hr a week, teleworking part-time, and reporting English as their native language were recruited through Prolific Academic, an online platform for human data collection (https://app.prolific.co/). The study consisted of two surveys spaced 1 week apart in which participants were requested to give information on several demographic, personality, and work-related variables. More information about this study, which was recently published (Scheibe et al., 2024), including study materials, data, and Mplus codes, is available at https://osf.io/hvyzu/. A screenshot of the data set is shown in Figure 1.

Fig. 1.

First 15 rows and 14 columns of the full organizational-psychology data set.

Data Set 2

This data set from the field of experimental psychopathology describes a study employing a trauma-film paradigm (see James et al., 2016). In general, the employed method serves to study the consequences of exposure to aversive material in healthy participants. The study presented here had two objectives: (a) to replicate earlier findings by Verwoerd et al. (2011) on the role of cognitive control (i.e., resistance to proactive interference) and neuroticism in the development of intrusive memories of a potentially traumatic film and (b) to optimize the trauma film for use in a future replication of another study (Holmes et al., 2009). Participants were international first-year psychology students from the University of Groningen (a cohort of approximately 500 students) who participated in the study in exchange for course credit. More information about this study, along with materials, data, and its associated preregistration, is available https://osf.io/9vtbz/. A screenshot of the data set is shown in Figure 2.

Fig. 2.

First 15 rows and 14 columns of the full experimental-psychopathology data set.

Step 1B: describe the age of your data set

Data Set 1

The data set on boundary management was collected during a lockdown period in the corona pandemic in 2021.

Data Set 2

The data set employing the trauma-film paradigm was collected in 2019.

Step 1C: distinguish between essential and optional variables

Data Set 1

A full description of all variables collected in the first data set is displayed in Table 2.

Table 2.

Variables Included in Original Data Set 1 (Names and Description) Before Making Data FAIR and De-Identified

PID	Participant number, used to link the data from different surveys
T1Exclude and T2Exclude	Indicates excluded cases and reason for exclusion (“include”; “exclude, no consent”; “exclude, failed attention checks”; “exclude, missings”)
T2data	Filter variable coded 1 when the participant has valid Time 2 data and 0 if not
T1Date	System-generated date variable for time that the first survey was completed
consent	Coded 1 when participant consented to the study and 0 if not
age	The age of the participant in years
gender	The gender of the participant, coded as 0 if male, 1 if female, and 2 if otherwise defined
country	The country of residence of the participant (e.g., “UK,” “Ireland,” “Canada,” “Scotland”)
edu and edu_6_TEXT	Highest level of education (“primary school,” “secondary school,” “secondary school with diploma,” “[technical] university degree,” “doctoral degree,” or “other”), including open response for “other”
partner	The partner status of the participant (“married”; “living together with partner”; “have partner, living apart”; or “single”)
children_1 to children_4 and children _0	The presence of children in the home (“yes, 0–4 years old”; “yes, 5–8 years old”; “yes, 9–12 years old”; “yes, 13–16 years old”; “no”)
occupation	Open response to indicate current position/occupation of the participant
sector and sector_14_TEXT	Business sector in which participant is employed (13 options such as “administration,” “health and social welfare,” or “other”), including open response for “other”
org_tenure	Number of years the participant has worked in the current organization
empl_contract and empl_contract_4_TEXT	Type of employment contract that the participant has (“permanent contract,” “temporary contract,” “contract via employment agency,” and “other”), including open response for “other”
contract_hrs	Number of contracted work hours of the participant
hrs_office and hrs_home	Number of hours the participant typically works at the office and home office per workday, respectively (between 0 and 12 hr)
supervisor and supervisor_1_TEXT	Indicates whether the participant has supervisory responsibilities and if yes, for how many employees
career_1 to career_4	Four items to rate the career stage of the participant (exploration, establishment, maintenance, and disengagement), each answered with 1 = strongly disagree through 5 = strongly agree
homepct and homedays	Percentage of time and days per week that the participant teleworked from home over the past several weeks
homespace_1 and homespace_2	Two items to rate the space and equipment/resources that the participant has available at their home, both answered with 1 = strongly disagree through 5 = strongly agree
homeexp	Item to rate the duration of participant’s experience with working from home, answered with 1 = less than 2 weeks through 7 = 10+ years
homereason_1 to homereason_5 and homereason_5_TEXT	Endorsement of different reasons to work from home (“preference,” “family situation requires it,” “employer requests it,” “government regulations,” and “other”), including open response for “other”
BS1_1 and BS1_2 to BS25_1 and BS25_2	25 items to rate different boundary-management tactics that the participant engages in; each item is assessed for the office setting (_1) and the home-office setting (_2), each answered with 1 = never through 5 = always
attcheck1_1, attcheck1_2, attcheck2_1 and attcheck2_2	Attention-check items, presented between BS22_2 and BS23_1
success_1 and success_2	Item to rate the subjective success of using the above boundary-management tactics for the office setting (_1) and the home-office setting (_2), both answered with 1 = not at all successful through 5 very successful
BSextra_1 to BSextra_12, and BSextra_11_TEXT	12 items to endorse additional, presumably socially undesirable boundary-management tactics, including open response for “other”
T1comment	Free-text response at the end of the first survey to indicate any ideas or comments about the study
T2Date	System-generated date variable for time that the second survey was completed
SLS_1 to SLS_5	Ratings of subjective life satisfaction, each answered with 1 = strongly disagree through 5 = strongly agree
WLB_1 to WLB_4	Ratings of work–life balance, each answered with 1 = strongly disagree through 5 = strongly agree
UT_1 to UT_6	Ratings of unfinished tasks, each answered with 11 = strongly disagree through 5 = strongly agree
T2AttCh1	Attention-check item
Affect_1 to Affect_16	Ratings of job-related affect over the past week, each answered with 1 = never through 7 = always
workload_1 to workload_3	Ratings of workload from the VBBA, each answered with 1 = never through 4 always
autonomy_1 to autonomy_4	Ratings of job autonomy from the VBBA, each answered with 1 = never through 4 always
jobactivities_1 to jobactivities_4 and jobactivities_4_TEXT	Percentage of time that participants’ job involves different activities (“interacting with others”; “individual tasks that require concentration”; “processing emails, social media, administrative tasks”; “other”), including open response for “other”
Homedemands_1 to Homedemands_3	Ratings of quantitative home demands, each answered with 1 = never through 4 always
IPIP_1 to IPIP_20	Ratings of the Big Five personality traits on the Mini- International Personality Item Pool (IPIP), each answered with 1 = strongly disagree through 5 strongly agree
T2attch2	Attention-check item, presented between IPIP_15 and IPIP_16
WFCen_1 to WFCen_5	Ratings of work-family centrality, each answered with 1 = strongly disagree through 5 strongly agree
BPref_1 to BPref_7	Ratings of boundary preferences, each answered with 1 = strongly disagree through 5 strongly agree
WFHPref and WFHExp	Working from home preferences and expectations (after the end of the corona pandemic) in days per week
T2comment	Free-text response at the end of the second surve to indicate any ideas or comments about the study

Note: Red cells indicate direct identifiers, and orange cells indicate indirect identifiers. FAIR = findable, accessible, interoperable, and reusable; VBBA = Vragenlijst Beleving en Beoordeling van de Arbeid (translated: Questionnaire Experience and Assessment of Work).

Main and secondary analyses

An exploratory factor analysis was performed with the variables BS1_2 to BS25_2 (note that only the variables with “_2” were included because these refer to the home-office setting). A structural equation model was estimated that specified Age as predicting boundary-management strategies, which, in turn, were specified as predicting work–life balance and unfinished tasks. Boundary management served as mediator and was modeled as a second-order factor of the five first-order factors identified in the first step: Factor 1 (BS2_2, BS3_2, and BS4_2), Factor 2 (BS1_2, BS5_2, BS6_2, and BS7_2), Factor 3 (BS13_2, BS14_2, and BS15_2), Factor 4 (BS8_2, BS9_2, BS16_2, and BS22_2), and Factor 5 (BS17_2, BS18_2, and BS19_2). The first dependent variable, work–life balance, was modeled as a latent factor with four indicators (WLB_1 to WLB_4). The second dependent variable, unfinished tasks, was modeled as a latent factor with six indicators (UT_1 to UT_6).

Manipulation check

No variables functioned as a manipulation check.

Inclusion criteria

Participants were included or excluded based on the value of variable T2Data (indicating persons who have valid data in the second survey).

Data Set 2

A full description of all variables collected in the second data set is displayed in Table 3.

Table 3.

Variables Included in Original Data Set 2 (Names and Description) Before Making Data FAIR and De-Identified

StartDate1, EndDate1	Date and times of starting and ending the questionnaires (Qualtrics) of Session 1
PP_number	Filled out by the researcher to indicate the chronological order of participants and link the data coming from different sources
SONA_number	A number that identifies the student in the internal university database
Age_4	The age of the participant in years
Language	Native language (“Dutch,” “English,” “German,” or “other”)
Language_4_TEXT	Specification of “other” in previous variable if applicable
BDI1 to BDI21	Participant answers to the 21 items of the Beck Depression Inventory (BDI)
PSD_5_1 to PSD_5_7 and PSD_5_0	Participant answers to eight items of a Posttraumatic Stress Disorder scale, asking if participants have ever experienced, witnessed, or been repeatedly confronted with “illness,” “physical assault,” “sexual assault,” “war,” “child abuse,” “accident,” “natural disaster,” or “none”
TSQ1 to TSQ10	Participant answers to 10 items on the frequency of experiencing posttraumatic-stress symptoms in the past week
N1 to N20	Participant answers to 20 items on neuroticism
NTotal	Total score on Neuroticism (sum of N1 through N20)
BDItotal	Total score on BDI (sum of BDI1 through BDI21)
TSQtotal	Total score on TSQ (sum of TSQ1 through TSQ10)
trauma	Trauma type (categorical variable derived from PSD 5_1 to PSD5_7)
PI	Proactive interference index (CVLT)
list_order	Order of word lists in CVLT
learning_ability	Percentage increase in correct recall of unshared items from the first recall trial to the final recall trial (CVLT)
StartDate2, EndDate2	Date and times of starting and ending the questionnaires (Qualtrics) of Session 2
VAS_pre._Sad_2 to VAS_pre._Disgust_7	Participant answers to seven items screening mood before watching the film
Time_Accident_Page_Submit	Time spent viewing the accident film
Time_Violence_Page_Submit	Time spent watching the interpersonal-violence film
VAS_post._Sad_2 to VAS_post._Disgust_7	Participant answers to seven items screening mood after watching the film
distress_film_4 to Rehearse_4	Participant answers to five items about the film they just watched
condition	Participant condition (“natural disasters/accidents” or “interpersonal violence”)
StartDate3, EndDate3	Date and times of starting and ending the questionnaires (Qualtrics) of Session 3
SONA_number_FU	SONA number entered at the 1-week follow-up (for purpose of checking; should be identical to SONA_number)
Compliance_1_1 and Compliance_2_1	Participant answers to questions on completeness and accuracy of their diary recordings, respectively
IMS1 to IMS22	Participant answers to 22 items of the Impact of Movie Scale, that is, retrospective ratings of film-related intrusive phenomena in the preceding week
IMStotal	Total score on IMS (sum of IMS1 through IMS22)
IMSintrusion	The intrusion subscale of the IMS, consisting of a sum score of Items 1–3, 6, 9, 14, 16, and 20
IMS_2011	IMS intrusion and avoidance items used in Verwoerd et al. (2011), consisting of the sum of Items 1–3, 5, 6, 8, 9, 11, 16, 17, and 20
MLQ1 to MLQ5	Participant answers to five items about how meaningful they experience their life to be
PEQ1_1 to PEQ1_8 and PEQ1_8_TEXT	Participant answers to nine items on reasons for participating in the study
PEQ2 to PEQ10, PEQ1_extra1, and PEQ_extra2	Participant answers to 11 questions on their rights, privacy, and debriefing
thoughtpress_count imagepress_count, and press_total	Counts of thoughts and images that came to mind during a 10-min period after watching the film; the variables include thought and images separately and the sum for each participant
Filmrelated_I, Filmrelated_T, Filmrelated_IT, Filmrelated_imagebased, and diaryTotal	Counts of diary entries for purely image-based, thoughts, combinations of thoughts and images, total image-based (i.e., sum of I and IT), and all film-related intrusions (i.e., sum of I, T, and IT)

Note: Red cells indicate direct identifiers, and orange cells indicate indirect identifiers. FAIR = findable, accessible, interoperable, and reusable.

Main and secondary analyses

The study had two parts. Part 1 had a correlational design. The independent variables were PI and Ntotal. The dependent variables were diaryTotal and IMS_2011. The secondary analyses included List_order, learning_ability, and distress_film_4 as covariates. Part 2 consisted of between-groups comparisons. The independent variable in this part was condition. The dependent variable for the main analysis was Filmrelated_imagebased. The preregistered secondary analyses included Thoughtpress_count, imagepress_count, press_total, and IMSintrusion.

Manipulation check

This study included some variables that served to check whether the manipulations were successful. These included variables that assessed prefilm and postfilm mood (VAS_pre._Sad_2 to VAS_pre._Disgust_7 and VAS_post._Sad_2 to VAS_pre._Disgust_7), checks related to watching the film (Time_Accident_Page_Submit, Time_Violence_Page_Submit, and distress_film_4 to Rehearse_4), and diary compliance (Compliance_1_1 and Compliance_2_1) .

Inclusion criteria

Participants were screened out depending on scores above clinical cutoff scores on variables BDI9, BDI1total, and TSQtotal and a score higher than 1 on variables PSD_5_1 to PSD_5_7 and PSD_5_0.

Step 1D: identify and classify the identifiability of variables in your data set

Here, we discuss the variables we classified as direct, indirect, or quasi-identifiers. For each variable, we carefully consider the risk each poses in terms of identification and the utility there is for publishing the variable as is (i.e., to not pseudonymize).

Direct identifiers, according to the GDPR, include specific types of information that can directly and unequivocally identify an individual. As a starting point, researchers could use the 18 direct identifiers listed by the protected health information in the Health Insurance Portability and Accountability Act (HIPAA): names; postal address information other than town or city, state, and zip code; dates related to an individual (e.g., date of birth); phone numbers; fax numbers; email addresses; social-security numbers; medical-record numbers; health-plan beneficiary numbers; account numbers; certificate/license numbers; vehicle identifiers and serial numbers, including license-plate numbers; device identifiers and serial numbers; URLs; IP addresses; biometric identifiers; full-face photographic images or comparable images; and any other unique identifying number, characteristic, or code. Note that although GDPR and HIPAA both consider direct identifiers, the GDPR has a broader definition of personal data (i.e., any information directly related to an identified or identifiable natural person).

Under the GDPR, there are certain data elements known as indirect or quasi-identifiers. Although these elements may not directly reveal an individual’s identity on their own, they could potentially be used in conjunction with other data to indirectly identify a person. When these indirect identifiers are analyzed collectively or paired with additional information, they can lead to the reidentification of an individual. The GDPR recognizes the significance of safeguarding such information even if it may not be directly associated with a person’s identity. Examples are age, ZIP codes, gender, occupation, and place of residence.

Data Set 1

Direct identifier was PID. Indirect or quasi-identifiers were T1Date, age, gender, country, edu, edu_6_TEXT, partner, children_1 to children_4, children _0, occupation, sector, sector_14_TEXT, org_tenure, empl_contract, empl_contract_4_TEXT, BSextra_1 to BSextra_12, BSextra_11_TEXT, T1comment, T2Date, and T2comment

Data Set 2

Direct identifiers were SONA_number and SONA_number_FU. Indirect or quasi-identifiers were StartDate1, EndDate1, PP_number, Age_4, Language, Language_4_TEXT, PSD_5_1 to PSD_5_7, PSD_5_0, trauma, StartDate2, EndDate2, StartDate3, EndDate3, PEQ1_1 to PEQ1_8, PEQ1_8_TEXT, PEQ2 to PEQ10, PEQ1_extra1, and PEQ_extra2

Step 1E: determine other sources of information that might influence the identifiability of individuals in your data set

Data Set 1

Future users of the data might be friends or family members of a participant, and they might be able to identify the participant based on time stamps or identification numbers (e.g., through the PID or T1Date variables).

Data Set 2

Future users of the data may be cohort members from university or staff members who may be able to identify the participant (e.g., through the SONA_number variable).

Step 2: Determine the Identifiability of Your Data Set

Step 2A: weigh the perceived risk and utility of the variables in your data set using the five-safes framework

When pseudonymizing variables, we used an adapted version of the LCRDM matrix based on the framework of the five levels of pseudonymization/anonymization (Table 1). In some cases, two variables are treated differently for each data set, reflecting the fact that the risk of identification is different for different underlying populations. The step-by-step de-identification of both example data sets presented below is also available on https://osf.io/eqbd3/ as two stand-alone documents.

Data Set 1

The direct identifier PID does not contain any actionable information for purposes of any potential analysis one might run. The variable is used to link data from different surveys, which means people with access to annotated data from one survey can identify participants using this variable. From the perspective of GDPR regulations, this variable should not be made openly available in an online public repository (see ANON from Table 1). In case the variable cannot be omitted from the data set, one can opt for restricted access in the data repository and make the metadata of the data set available. In the event in which the original researcher team were to extend the longitudinal design with an additional survey, researchers could use version control to update the existing data set by adding the new data. This would allow the data to be incrementally updated without requiring the personal identifier (PID) in the publicly available version because the updated data set could be structured to avoid including this sensitive information.

For T1Date and T2Date, there is use in knowing when data were collected for purposes of putting study results in context. For instance, the data for this study on boundary management were collected during a lockdown period in the corona pandemic in April 2021. That being said, the individual time stamps serve no purpose for either the original research or a potential reanalysis. Article 5(1)(c) of the GDPR defines data minimization by saying that personal data should be “adequate, relevant and limited to what is necessary in relation to the purposes for which they are processed.” Thus, we removed these variables from the published data set.

The variables age, gender, country, edu, edu_6_TEXT, partner, children_1 to children_0, occupation, sector, sector_14_TEXT, org_tenure, empl_contract, and empl_contract_4_TEXT contain information that could be used in combination with one another to potentially identify an individual. Identification is not very likely in the context of this data set because participants were recruited online through Prolific Academic and come from countries across the world. However, note that none of these variables are used for main/secondary analyses. Variable gender had three response categories (0 if male, 1 if female, and 2 if otherwise defined), but in this data set, only the first two response options were used, meaning this variable does not pose a great risk for identification and can be made available. Variable occupation should be removed because the more generic sector and sector_14_TEXT are available (cf. income/job, PS2 from Table 1). Under variable country, there was one participant whose country of residence was Kenya, a country that has fewer than 25 Prolific Academic users (checked October 18, 2023). To reduce risk of identification, all participants that list a country in Africa were grouped under area Africa (not represented in Table 1, but it follows the same logic of making the data entry more generic and thus less likely to lead to direct identification). Finally, variables age and country can be made available in an online public repository because they do not provide enough information to identify the participant without additional information.

The variables BSextra_11_TEXT, T1comment, and T2comment were not used in the reported analyses, but they could be used as inspiration for future studies. Although not containing identifying information by default, care should be taken that each answer is screened for giving away identifying information (e.g., “There are not a lot of customers at the Clemens hospital in Münster where I work” would be problematic). In the present worked examples, the answers to these variables are all quite generic and are mostly feedback about the study. We retain these variables for this article, but as a general rule, it may be safer to fully remove open-answer variables.

Data Set 2

For variables StartDate1, EndDate1, StartDate2, EndDate2, StartDate3, and EndDate3, data for this study were collected in 2019. The individual time stamps serve no purpose for either the original research or a potential reanalysis. In addition, for a study in a physical lab, participants are potentially identifiable from a time stamp alone. Thus, we recommend that these variables are removed from the published data set.

The PP_number variable does not contain any actionable information for purposes of any potential analysis one might run. It is, however, conceivable that someone may know who the first (or last) participant was to join this study. In addition, these numbers will be linked to data that contain identifying information that is kept on secure university servers, such as the entries in the diaries and comments in the lab log. From the perspective of GDPR regulations, this variable should not be made available in an online public repository (see ANON from Table 1).

The direct identifier SONA_number does not contain any actionable information for purposes of any potential analysis one might run. SONA is the system for compulsory participant study points during the first year of the Psychology Programme at the University of Groningen, and it contains identifying information of the participant, meaning someone with access to this information could use this to link the sensitive data from this study to specific individuals. From the perspective of GDPR regulations, this variable should not be made available in an online public repository (see ANON from Table 1).

Although the variable Age_4 is not used in the reported analyses, it could be used as a covariate by someone interested in doing an alternative analysis. However, this variable also contains information that could be used to trace back to the individual. Thus, we recommend removing Age_4 but creating a new variable Age_Cat with binned age categories “17–19,” “20–21,” and “>22” (see PS3 from Table 3).

Although the variable Language is not used in the reported analyses, it could be used for subsequent analyses. Although in many cases this variable will be fairly generic (cf. AREA, PS3 from Table 1), note that anyone registering a different native language than Dutch, English, or German could be sufficiently rare that such information could be used to trace back to the individual. Thus, we recommend removing the variable Language_4_TEXT. Inclusion of the variable Language itself is a corner case because there are only nine participants who indicated English (compared with 114 and 62 for Dutch and German, respectively), but we judge the number of English native speakers in the psychology undergraduate program of the University of Groningen sufficiently large (in the order of 50 to 100) that we believe this variable can be retained.

For variables PSD_5_1 to PSD_5_7, PSD_5_0, and trauma, although participants who were confronted with potentially traumatic situations other than serious illness of themselves or people close to them were screened out, the data of three participants contain such codes because of experimenter error. The original researchers discovered this only when data had already been collected. They verified that participants were not negatively affected by their participation. At this point, the researchers deemed it unethical not to use the data of these three participants. Because of their rare occurrence, these scores could be traced back to the individual. As a result, these nine variables should be removed from the data.

The variables PEQ1_8_TEXT (and PEQ1_1 to PEQ1_8) were not used in the reported analyses, but they could be used as inspiration for future studies. Although not containing identifying information by default, care should be taken that each answer is screened for giving away identifying information (e.g., “As a 28-year old woman, I love watching stressful movies” would be problematic). The answers on PEQ1_8_TEXT in this particular case are all quite generic (e.g., “The study sounded very interesting”) and can safely be included in the published data set.

The variables PEQ_extra2 (and PEQ2 to PEQ10, PEQ1_extra1) are not used in the reported analyses, but they could be used as inspiration for future studies. Once again, care should be taken that each answer is screened for giving away identifying information. The answers on PEQ1_extra2 in this particular case are all quite generic, mostly relating to preparing the participants more for the stressful nature of the stimuli in the experiment, and can safely be included in the published data set.

Step 2B: consult the local data steward and/or privacy officer to assess identifiability

Once a selection is made of which variables to publish as is, which variables to de-identify, and which variables to exclude from publication, the next step is to assess the extent to which the de-identification has been successful. In the context of this article, we collaborated with our local data steward (M. de Jong, coauthor on this article) to assess de-identification of both Data Sets 1 and 2, and we have consensus that the two data sets can be made publicly available as per the description in Step 2A above.

Step 2C: determine whether it is useful to apply quantitative methods to quantify the identifiability of your data set (e.g., k-anonymity, t-closeness, l-diversity) and apply appropriate quantification methods

Formal metrics exist for quantifying levels of identifiability (see section “De-Identification Frameworks”), but each of these is heavily context dependent. Suitability of these metrics may be highest for data for which it is easy to find out what individuals are in the data set (e.g., data collected from very small populations with specific characteristics) or for data for which the sample reflects the entire population (e.g., data from all university students). In a large number of research projects in the field of behavioral and social sciences, however, data sets are often only a very small subsample of the population they reflect. Identifiability in these cases more often relates to the sampling method (i.e., what is known about the population in relation to the sample) and the characteristics of participants that might be very unique in this population, making it easier to single out an individual. The two example data sets under consideration in this tutorial article fall in this category.

For most data sets collected in behavioral and social sciences, we do not believe it is presently possible to circumvent human judgment completely. We recommend using quantification metrics only in collaboration with experts on de-identification, such as data stewards or privacy officers. They can provide advice on its usability and the levels of, for instance, k-anonymity that should be used when assessing the anonymity of your data set.

Step 3: Design De-Identification Techniques for the Variables in Your Data Set

Step 3A: remove (direct) identifiers from your data set

For Data Set 1, we have removed variables PID, T1Date, and T2Date. For Data Set 2, we have removed variables StartDate1, EndDate1, StartDate2, EndDate2, StartDate3, EndDate3, PP_number, SONA_number, SONA_number_FU, Language_4_TEXT, PSD_5_1 through PSD_5_7, PSD_5_0, and trauma.

Step 3B: define and apply appropriate de-identification techniques to variables in your data set

For Data Set 1, we have pseudonymized variable country by binning all individual African countries under the label “Africa.” For Data Set 2, we have pseudonymized variable Age_4 by binning age ranges under labels “17–19,” “20–21,” and “>22,” respectively.

Step 4: Go Back to Step 2 Until the Data Are Sufficiently De-Identified

It is likely that the process described above will be an iterative process. For example, for Data Set 1, we had initially opted to remove the gender variable, but inspired by a reviewer, we verified that no response options were chosen that could lead to reidentification, and we chose to publish the gender variable as a result. The number of iterations needed will likely depend on the data set at hand.

Step 5: Document the De-Identification Procedure and Archive/Publish This With the Data

The final step is to document the results of the previous four steps. This will prove helpful in interpreting the data for future users. What was the context in which the data were collected? What decisions were made when publishing the data?

For Data Set 1, the results of these steps have been documented at https://osf.io/fz9pu. For Data Set 2, the results of these steps have been documented at https://osf.io/kq32a.

Part 2: Making Anonymized Data Publicly Available While Adhering to the FAIR Principles

Once a data set has been properly de-identified to the extent that it is anonymous, the data set can be made publicly available. An effective and comprehensive way of doing so is to make the data set FAIR. The publication that coined the FAIR acronym lists 15 principles that can serve as a guide when making data sets FAIR (Wilkinson et al., 2016). Some of these principles fit empirical data sets from the behavioral and social sciences better than others. In what follows below, we address each of the principles using the framework by Jacobsen et al. (2020) as a guide and show how we implemented these principles in the two example data sets.

Several FAIR implementation communities have already defined FAIR implementation profiles (FIPs) to provide community standards for the implementation of the FAIR principles within and across research domains (Schultes et al., 2020). For the behavioral and social sciences, there are currently two FIPs available that can be used as guidance for the design for the implementation of the FAIR principles (i.e., SSSR and EduSocDL; https://fip-wizard.ds-wizard.org). Before implementing new solutions for each of the principles in the current tutorial article, these practices were assessed and when appropriate, implemented based on their fit with the data sets in the current tutorial article to comply with community standards in the field of behavioral and social sciences.

Findable

Principle F1: (meta)data are assigned a globally unique and persistent identifier

To be findable, data sets should be assigned globally unique and persistent identifiers. There are several persistent identifiers that are used by researchers to make their data set findable. Most data repositories provide the option to create persistent identifiers as soon as the data are published, which is in most cases the Digital Object Identifier (DOI). This option is also available in OSF, and therefore, DOI is used as a persistent identifier for both data sets. Any object with a DOI can be found by appending it to the string “https://doi.org/,” so in this case, the associated URL is https://doi.org/10.17605/osf.io/eqbd3.

In the context of this tutorial article, we present two distinct empirical data sets, and for each data set, we incorporate both an unprocessed and a processed version. Because the focus of this article is on the ensemble of all data sets rather than any of these data sets in isolation, we associate the entire project page with a single DOI (both data sets are part of separate publications that are associated with OSF pages https://doi.org/10.17605/osf.io/hvyzu and https://doi.org/10.17605/osf.io/9vtbz, respectively). In addition, OSF enables researchers to produce unique links to refer to individual data files (e.g., the unique link for the curated version of the .csv file belonging to Data Set 1 is https://osf.io/xu653). However, in articles that present data from multiple data sets that may be of interest in isolation (e.g., in empirical articles that present multiple studies), associating each data set with a unique DOI likely makes more sense.

Principle F2: data are described with rich metadata

To increase findability of a data set, it is also important to be able to find a data set based on the information contained in the data set. Principle F2 is meant to make a data set more findable based on the information in the data set through adding rich metadata to the data. Metadata, in the context of data management and the FAIR principles, refers to structured machine-readable information that describes various aspects of data. Metadata provide context and understanding about data, making it easier to discover, access, evaluate, and use.

Data repositories often provide guidance on the kind of project-level metadata that should be added when publishing data. Repositories either provide a standardized format to describe the (meta)data or have several metadata schemata to choose from based on the needs of a specific research project. There are several domain-agnostic metadata schemas that are suitable to apply to any data set that is published in a repository (e.g., DublinCore, DataCite, Data Documentation Initiative, and OpenAIRE). In some fields, it is important to know certain information about the data set to determine whether it is useful to answer new research questions. For example, in the case of human-cognitive neuroscience, it is important to know additional information about the data set and study, such as the neuroimaging techniques (e.g. MRI or EEG), the preprocessing pipeline, and specifics regarding the experimental paradigm. At the time of writing, a general-psychology metadata standard is under development (Psych-DS, see https://github.com/psych-ds/psych-DS). A (preliminary) version of both of the psychology and the human-cognitive-neuroscience standards can be used by researchers through OSF by clicking on the tab “Metadata” and then “Add Community Metadata Records” (https://osf.io/eqbd3/metadata/add).

Given that the psychology metadata standards were still under development when the data sets were processed, we have chosen to implement a domain-agnostic metadata standard. The data sets presented in the current tutorial article were published via OSF. Although OSF does not require researchers to describe the (meta)data in a standardized format, we have chosen to implement the Dublin Core metadata standard, which is one of the recommended metadata schemas of OSF.

Principle F3: metadata clearly and explicitly include the identifier of the data it describes

To make all research materials findable, it is important to connect the metadata with the data it describes. In our example, the DOI https://doi.org/10.17605/osf.io/eqbd3 links to a landing page that contains all (meta)data files and accompanying data documentation (i.e., general study information, codebook, and de-identification procedure) in one place. The persistent identifier should also be mentioned in the main article (e.g., in the method section and the data-availability statement), although this is not technically a FAIR issue. It may be preferable to have separate identifiers for different layers of (meta)data and associated data documentation, especially when there are many different associated files.

Principle F4: (meta)data are registered or indexed in a searchable resource

Findability increases when the data are deposited and indexed in a searchable resource, such as OSF, Zenodo, or DataverseNL. For the two data sets under consideration here, we have opted for OSF. OSF registers metadata of the landing page of a project with DataCite.

Accessible

Principle A1: (meta)data are retrievable by their identifier using a standardized communications protocol

Data(sets) should ideally be in the public domain and available to all without any restrictions. A simple example is a web page (the standardized communications protocol is Hypertext Transfer Protocol, or HTTP). We made our two example data sets publicly available on OSF (https://osf.io/eqbd3) and all the relevant protocols, such as the codebooks belonging to the data sets (e.g., https://osf.io/wqu9y/).

Subprinciple A1.1: the protocol is open, free, and universally implementable

A simple example is a web page (see above.) The open HTTP makes the data easily accessible to researchers from different countries and communities.

Although this does not strictly fall under this subprinciple, we note here that for a data set to be accessible, it needs to be in a format that does not require proprietary software to open (https://en.wikipedia.org/wiki/List_of_open_file_formats). The current data sets are both .sav files, meaning they require the paid SPSS (but see JASP; JASP Team, 2022). To convert .sav files into .csv files, we used R and the foreign package (https://cran.r-project.org/web/packages/foreign/; see also the example code on https://osf.io/eqbd3).

Subprinciple A1.2: the protocol allows for an authentication and authorization procedure when necessary

The “A” in FAIR does not necessarily mean open or free, but it does mean that researchers should provide the exact conditions under which the data are accessible. There may be good reasons not to disclose data and instead publish data sets under “restricted access” (e.g., privacy protection or ethical, legal, or commercial constraints). In such cases, the procedure for gaining access should be open (see Subprinciple A1.1). In case of privacy concerns, it is important to take extra measures to protect the rights and freedoms of participants in consultation with the local institutional review board. In addition, it is important to ensure that researchers who would like to reuse the data adhere to the terms of use. Authentication procedures can be set up to provide human access or machine access. An example of human access would be sending an email to an administrator that evaluates requests for data access on an individual basis. An example of machine access would be to require institutional login to gain access. Similar protocols are used when universities have (online) journal subscriptions because they allow researchers affiliated to these universities to gain access to articles without having to pay for individual access. The institutional login serves as a proxy for having met the relevant access conditions.

Policies with regard to data sharing with access restrictions may vary across universities or institutes. Authentication procedures requiring machine access are likely difficult to implement for small-scale individual research projects. For example, the University of Groningen currently recommends using DataverseNL (that supports a connection to OSF) because the University of Groningen can exercise control of the data and the access procedure in shared responsibility with the researcher, ensuring sustainable access protocols even after a researcher leaves the university or scientific career. Policies regarding publishing open anonymized data are less strict, providing researchers with more freedom with regard to which repository they would like to use. At the University of Groningen, important criteria are longevity of the platform, clear governance, and support of rich metadata.

Principle A2: metadata are accessible even when the data are no longer available

As mentioned under Principle F3, we recommend storing all (meta)data and accompanying documentation together under a single persistent identifier. This helps to ensure that metadata will remain available at a future date even if the data themselves are no longer accessible (e.g., because the informed consent specifies the data will be stored for only a finite period of time). The exception to this is the situation in which the project administrator chooses to remove or delete the entire project. Depending on the data repository, different removal protocols exist. In the case of OSF, all associated data files will be permanently deleted, and a project DOI will resolve to a page “that provides metadata about the removed file (file name, storage provider, if the deletion occurred on OSF or on an add-on service, name/GUID of user who deleted the file, and timestamp of file deletion)” (retrieved from help.osf.io on January 13, 2025).

Interoperable

Principle I1: (meta)data use a formal, accessible, shared, and broadly applicable language for knowledge representation

The heart of the interoperability principle lies in the concept of machine readability (i.e., a data format that can be automatically read and processed by a computer). Examples of formats that describe data files include JSON, XML, and RDF (and for references, RIS, BibTeX, and EndNoteXML). In our example data sets, we followed the DublinCore metadata standard when describing the data set and transforming the data into comma-separated values format “2023-09-22 Data Set 1, Curated.csv” and “2022-12-08 Data Set 2, Curated.csv.”

Related are formats that describe the contents of the data itself. Examples are BIDS (https://bids.neuroimaging.io/) from the field of neuroimaging and Neurodata Without Borders (https://www.nwb.org/) from the field of neurophysiology. At the time of writing, psychology does not have standardized formats for data description.

Principle I2: (meta)data use vocabularies that follow FAIR principles

This principle has to do with the presence of widely recognized controlled vocabularies, ontologies, or thesauri with globally unique and enduring identifiers that a computer can recognize. Jacobsen et al. (2020) used the example of the label “temperature”: Does it refer to body temperature or melting temperature? Without knowing, a machine would not be able to find agreements or disagreements between data sets.

Taking “work-life balance” (see Data Set 1) from the European Language Social Science Thesaurus (ELSST; https://thesauri.cessda.eu/elsst-3/en/) as an example, the vocabulary includes the following identifier: urn:ddi:int.cessda.elsst:0722ed90-dfb8-46e0-8c96-37f1cc3d7337:3. This identifier allows a computer to disambiguate “work-life balance” from, for instance, “happiness” (which has identifier urn:ddi:int.cessda.elsst:c969ed50-6501-4cc6-b3b3-4a4207ef0f6c:3).

Principle I3: (meta)data include qualified references to other (meta)data

The idea is to ensure that (meta)data sets contain relevant and standardized references to one another. Continuing with the example of “work-life balance” in ELSST, the entry lists as related concepts “flexible working time,” “hours of work,” “job sharing,” “shift work,” and “Sunday working.” The Appendix includes the JSON segment that machines can use to identify these relationships. The OSF platform has a similar functionality for relating (meta)data “under the hood.”

In the context of a batch of individual studies collected in a longitudinal context, it is important to ensure that labels are applied similarly across data sets and that metadata files refer to the relevant data sets appropriately.

Note that in the context of this tutorial article, we have made two data sets available for educational purposes (i.e., in the context of providing a guide on how one might go about de-identifying one’s data while adhering to FAIR principles). The data sets are also part of their own empirical research cycle that are published separately (i.e., https://doi.org/10.17605/osf.io/hvyzu and https://doi.org/10.17605/osf.io/9vtbz). In that context, the function of the data sets is different, and additional material, such as stimulus material or analysis scripts, might be appropriate that is not relevant in the context of the present article. These empirical articles have a separate OSF page associated with them, and it is important that the OSF pages cross-reference each other when appropriate.

Reusable

Principle R1: (meta)data are richly described with a plurality of accurate and relevant attributes

In our opinion, when the first three letters of the FAIR acronym are addressed, the reusable part follows somewhat naturally. To enable decision-making for individuals who want to reuse the data (machine or human), researchers should provide both machine-readable metadata and human-readable data documentation for reasons of findability but also to inform a potential new user on the context under which the data were generated. We reiterate the recommendation of Jacobsen et al. (2020) to be as generous as possible when describing data: The reader is almost assuredly less knowledgeable about data than the person collecting it.

For the current article, this means that we followed the machine-readable metadata schema Dublin Core (for more information, see “Principle F2”). In addition, both data sets were accompanied by human-readable data documentation. More specifically, we included a codebook explaining what each variable represents and measures, the de-identification protocol and the choices that were made to de-identify the data, the combined research information for participants and informed-consent form, and an R script that transforms the .sav files to .csv files.

Subprinciple R1.1: (meta)data are released with a clear and accessible data usage license

Legally speaking, public data must be accompanied by a data-usage license for it to be open. We have used and recommend using the CCO 1.0 Universal license, which places the data in the public domain (CC BY 4.0 technically requires that attribution should be given to sources that created the data set). Note that OSF facilitates setting a data-usage license.

Subprinciple R1.2: (meta)data are associated with detailed provenance

This principle pertains to a description of the data, including how, why, and by whom it was generated (including if the data were provided by a third party), to support other researchers assess the reusability of the data for their own research purposes. Part of this provenance, such as information about contributors, abstract, and date of creation, is already included in the machine-readable metadata (for full information that is included, see JSON file that is harvested by DataCite: https://api.datacite.org/application/vnd.datacite.datacite+json/10.17605/osf.io/eqbd3). In this project, it was also important to indicate how data were de-identified (because de-identification potentially reduces the available information in a data set) so that other researchers can evaluate whether data are useful to their own research questions. Therefore, we included human-readable documentation on the de-identification procedure and information about how the study was conducted (i.e., a wiki with a general description of the original research setup, codebook and combined research information for participants, and a blank informed-consent form). For more detailed information on the original research setup and the analyses conducted for the individual research articles, we refer to the separate OSF pages containing the (meta)data and data documentation that are connected to these research articles.

Subprinciple R1.3: (meta)data meet domain-relevant community standards

This principle describes the need for minimal information community standards, or vocabularies, to assess data quality and to allow for replicating the reported findings. An example of such vocabularies in psychology is the fifth edition of the Diagnostic and Statistical Manual of Mental Disorders (https://www.psychiatry.org/dsm5). Other examples are the ELSST (https://thesauri.cessda.eu/elsst-3/en/), the Thesaurus for Social Sciences (https://concepts.sagepub.com/vocabularies/social-science/en/), and the Thesaurus of Psychological Index Terms (https://www.apa.org/pubs/databases/training/thesaurus).

Regardless of whether relevant community standards are present, the reproducible data set should be accompanied by extensively and clearly annotated reproducible code (e.g., through using R scripts; see e.g., https://osf.io/qegy4). In the example data sets, we chose to retain as the bare minimum variables that are used (a) in the main (and secondary) analyses, (b) in the manipulation check, and (c) as criteria for inclusion in the study. For both example data sets, we discuss key variables for each of these in section “Step 1C: Distinguish Between Essential and Optional Variables.” In addition, we retained variables that we deemed safe in terms of risk of reidentification and that may be of interest to other researchers for follow-up analyses.

Discussion

Open data has clear merits for science because it enables researchers to detect mistakes, answer different research questions without collecting new data, and therefore increase efficiency and offer opportunities for academic progress. The advantages of openness for science must be balanced with the potential risks of openness with respect to the privacy of participants. In the European Union, researchers need to comply with the GDPR law, which may seem a delicate exercise for individuals not familiar with this process. More generally, however, researchers from non-EU countries also need to find a balance between the scientific interests (as open as possible) and the privacy interests of participants (as closed as necessary).

In this article, we introduced the five-safes framework (Desai et al., 2016) and the pseudonymization/anonymization matrix to guide researchers in making decisions about data usage that are potentially sensitive or confidential, and we apply it through a stepwise approach (based on the anonymization plan from Radboud University Nijmegen by van der Burgt et al., 2024, which is, in turn, based on a template from the Finnish Social Science Data Archive) to two worked examples of concrete choices using real data sets. In the first example, we use a data set from organizational psychology that was collected with the aim of studying age differences in boundary management during telework. In the second example, we used a data set from experimental psychopathology coming from an analogue study employing a trauma-film paradigm. We systematically worked through both data sets, making both FAIR while limiting risk of reidentification of participants based on the published data. Throughout the process, we collaborated with our local data steward to design appropriate de-identification techniques, implement the FAIR principles, and assess whether de-identification was successful.

In Part 1 of this tutorial article, we presented a step-by-step de-identification process that we applied to two worked examples presented in this article. We learned several important lessons when preparing these examples. First, as a general rule, variables that contain time stamps or person IDs should never be made publicly available. They have no utility for other researchers, and even in the case of online data collection, they may be enough to identify participants with. Having seen someone walk out of a lab or having seen someone do a study online means knowing the time stamp alone would be sufficient information to retrospectively identify a participant. The previous point does not change the fact that there will be many cases in which it is useful to describe the approximate time an entire data set is collected. For instance, when investigating the effects of working from home, it is relevant for the reader to know that data were collected during the corona pandemic, when many people were required to work from home. In addition, there may be cases when person IDs could be beneficial to link different data sets that are collected in a longitudinal batch. In such cases, we recommend restricted access to the data or replacing the person ID that was assigned in the context of the original study with a linking ID on the online repository so that research assistants involved in collecting one survey cannot use published data from a second survey to identify people.

Second, some variables may require different treatments in different data sets. When participants come from a local student population, as was the case in our second worked example, variables such as age, nationality, or native language might be unique identifiers. If, on the other hand, participants in the data set were recruited online with the only restriction being that their native language is English, as was the case in our first worked example, the same variables age, nationality, or native language no longer pose any risk of identification. Similar concerns apply to a variable such as gender. Have minority response categories been selected that would allow identification of participants? For our first data set, everyone identified as female or male, mitigating the risk of identification, but this may obviously not hold for different data sets. Thus, the context in which variables were collected is really important in determining whether to anonymize or pseudonymize variables.

Third, variables that contain open-text answers should always be scrutinized for identifying information if researchers consider making them publicly available. Evaluating text answers to decide whether they can be included in the published data set can be a tedious and complicated task. A first step is to have at least two researchers check independently whether respondents included identifiable locations or names. If so, these identifiers should be deleted from the data set that is prepared for a third party. A second step is to have those researchers check whether respondents reveal recognizable personal events. If so, those events should either be substituted for a different but related event (e.g., “house was flooded” can become “power outage”; “basketball training” can become “soccer training”) or should be replaced by a more general and therefore less traceable event (e.g., “house was flooded” can become “accident at home”; “basketball training” can become “leisure activity”). We advise that in case the variables are not directly relevant (e.g., when participants are given the opportunity to share their thoughts or feelings about the study), the easiest solution is to simply not publish this variable altogether. If it is an important variable, however, carefully going through all fields is unavoidable, and some subjectivity in interpreting which answers can and which should not be published cannot be prevented.

In some cases, it might be impossible to completely anonymize a data set. In such cases, a solution is to replace the original raw data with synthetic data (e.g., using the synthpop R package, available from https://cran.r-project.org/web/packages/synthpop/index.html). Such an approach has the advantage of allowing researchers to publish variables for which de-identification is difficult or impossible by replacing individual entries of participants with simulated values (analogous to our approach for de-identifying the full, preprocessed data). Depending on the needs, it is possible to uncouple data entries from specific participants while retaining some overall property. For instance, in case of a simple two-group comparison, the group means can remain the same but every individual data value is different. A downside of this approach is that potential future analyses that factor in different dimensions will no longer be possible. For instance, in the previous example, a follow-up analysis that takes age as a covariate will not yield sensible results because the simulated values were not created with respect to the age variable, just the group membership. Another disadvantage is that this sort of approach does not work well for any type of repeated measures design, such as an intensive longitudinal data study, because the dependency of data entries across measurement points needs to remain intact.

A general lesson we would like to stress that has not yet come up explicitly in this article is that researchers who want to open up their data should start as early in the research cycle as possible, for instance, by thinking about the text of the informed consent. When collecting or processing personal data at any point during a research project, the GDPR requires that participants are informed on how data will be processed and whether data will be shared with others. Typically, researchers in the field of behavioral and social sciences inform participants through an information form and ask for consent from participants to process personal data for the purpose of their research, including the reuse of these data for future research projects either by themselves or the larger research community. If consent was used as the legal basis for processing the data, it is important to either define under what conditions participants consent to the reuse of their personal data or to anonymize the data to such an extent that data cannot be traced back to individual data subjects under any circumstance before making it publicly available. Even though it might not be necessary from a legal perspective, from an ethical perspective, one could argue that participants should be informed about the sharing and reuse of anonymized data. More specifically, research participants invested time and effort to participate in this specific study and might not agree with their data being reused for other purposes.

When writing this article, we were painfully reminded about the importance of thinking about the phrasing of the informed consent well in advance of a study. For our first worked example, we started by processing a similar data set from the same researchers, and only after a while, we found out that the consent form did not ask participants for permission to share the personal data for reuse by the scientific community. Although this cost us some time, we found a different data set for which participants did explicitly give permission for data sharing among researchers. To prevent such errors, we highly recommend including a blank consent form along with data files because at the very least, this action serves as an extra reminder to verify that sharing is appropriate. A common issue may be the lack of specificity in the consent given by the participant because details about who has or will have access to the data are not always present. The GDPR explicitly states that the study description on which basis the participant gives consent must be specific and informative.

Besides the legal ground for collecting, analyzing, and reusing personal data in a research project, it is also important to consider another principle of the GDPR at the beginning of your research, that is, data minimization. According to GDPR Article 5, when dealing with personal data, researchers should collect and use only the data necessary for the study purpose. For instance, survey-collection software typically collects dates, IP addresses, and latitudinal and longitudinal coordinates. If these variables do not serve any purpose for the research, a researcher should try to prevent collecting these data or if this is not possible, discard them as soon as possible after data collection. Although it fell outside the scope of the current article to provide guidance on how to implement the principle of data minimization, designing a research project while keeping this principle in mind would support a more efficient road to making research data collected from human participants FAIR.

The two examples discussed in this article are only two of many potential data sets we could have used. Every data set has its idiosyncratic issues and decisions that need to be made, but we believe that the issues that have emerged from this exercise are applicable for many other data sets within and outside the social sciences. Anonymizing or pseudonymizing raw data will not always be possible, for instance, in cases in which the raw data include video material of participants. Video-editing techniques exist for blurring faces and distorting voices, but they may not always be sufficient to fully anonymize the data. In such cases, we recommend researchers make the processed data available, for instance, the coding of the behavior in the video, along with the codebook that explains how coding was done. At the very least, this would flag that these data exist and enable others to contact the authors of those data.

In Part 2 of this tutorial article, we described how we made the anonymized data publicly available while adhering to the FAIR principles. Here, we had two important insights. First, we realized that openness does not refer to just the availability of the data itself but also relates to the software program it was processed in. Data files for both worked examples were available in SPSS, but one might question whether it is inclusive to use commercial packages (which may be unavailable to many researchers) to present data. Researchers should be aware that making data files that were collected using proprietary software publicly available may in practice mean the data are open to a limited and arguably privileged subgroup of researchers. Fortunately, several options are available for converting such data files to a format that is publicly accessible, such as .csv. When choosing such an option, it is key to confirm after conversion that all relevant information is retained (e.g., variable labels, potential references between multiple tabs).

Second, the language in which the original data were collected is not always accessible to other researchers. We mentioned we had to abandon a different data set as our first example because the informed consent restricted sharing the data. In this unused data set, participants were all German native speakers. Although we were sufficiently fluent to at least understand the variable labels and open-text answers, this will likely not hold for the academic community at large. This presents a dilemma: On the one hand, translating the data set to English would be more inclusive and would make the data file more accessible, at least in an intellectual sense. On the other hand, is it reasonable to require researchers that have collected data in a language other than English to translate their data, stimulus material, and other materials to English? We recommend translating the codebook (including variable labels) to English because failing to do so hampers understanding of how reported research questions are operationalized. However, we do not think it should be necessary to translate all participant statements in a data set that is made available online. Regardless, whenever anything is translated, the original untranslated version should be made available as well.

Recently, researchers and data support staff have been working on the implementation of FAIR collaboratively, resulting in best practices in several research fields. While writing this article, we came across the initiative of GO FAIR Foundation to develop FIPs together with the community to provide community standards for the implementation of the FAIR principles in and across research domains (Schultes et al., 2020). For the current article and data sets, it provided extra guidance to use the FIPs that were developed for social-science survey research as a reference for choices that we made for the current tutorial article (https://w3id.org/np/RA2C1h_SkOgPiylavYM6bs_wSW6SgzrvC5kiVdJvmWq9s). We recommend that researchers look into these FIPs for their own respective research fields before developing their own practices. Although only a few FIPs have been developed and these best practices are not yet known very well by actual researchers, these efforts can save a lot of time and effort when trying to develop a FAIRification pipeline for yourself.

In the end, there are several tools that help researchers to check how FAIR a data set is, including the CSIRO 5 star data-rating tool (Yu & Cox, 2017, Version 5), the DANS FAIR-Aware tool (https://fairaware.dans.knaw.nl/), and the Australian Research Data Commons’s FAIR data self-assessment tool (https://ardc.edu.au/resources/aboutdata/fair-data/fair-self-assessment-tool/). These tools allow to check how findable, accessible, interoperable, and reusable a data set is and give insight into how one can enhance its FAIRness. For findability, for example, the Australian Research Data Commons’s FAIR data self-assessment tool asks the following four questions: (a) Does the data set have any identifiers assigned? (b) Is the data set identifier included in all metadata records/files describing the data? (c) How are the data described with metadata? (d) What type of repository or registry is the metadata record in?

An important concluding observation is that our decision process in preparing both data sets for online sharing is subjective. The reader may not have agreed with every decision we have made on this front. We do not, however, believe it is possible to have a “gold standard” for data sharing. Rather, we view the factors that feature into the decision of how far to de-identify and what factors to consider when FAIRifying a data set as the important take-home from this tutorial article. We hope this work can provide inspiration on how to approach the delicate process of opening up data while respecting the privacy of the participants that provided the data.

Concluding Remarks

Sharing data has clear benefits to the academic community and to society at large. We hope the reader shares our conviction that what to share and what to de-identify is a process that requires careful consideration: As is often the case in science, no mindless ritual exists. We hope the worked examples in this article provide researchers with some inspiration on how to balance openness and privacy when sharing their own data sets.

Footnotes

Appendix

In the European Language Social Science Thesaurus, the concept “work-life balance” lists as related concepts “flexible working time,” “hours of work,” “job sharing,” “shift work,” and “Sunday working.” To allow machines to identify these relationships, the metadata of the work–life balance entry includes the following JSON segment:

“related”: [

{

“uri”: “https://elsst.cessda.eu/id/3/4cfebc40-c8e8-49d3-a795-7a08b80176dc”

{

“uri”: “https://elsst.cessda.eu/id/3/3c299358-5025-4b29-9dfd-12a98050fd5f”

{

“uri”: “https://elsst.cessda.eu/id/3/30a2a62f-de0d-414e-a69c-4fc9f16458ee”

{

“uri”: “https://elsst.cessda.eu/id/3/9371e415-d7e4-4a3c-9d55-4e8f42785589”

{

“uri”: “https://elsst.cessda.eu/id/3/9e3f93ea-e7ec-457a-afb6-61522d4db3c6”

{

“uri”: “https://elsst.cessda.eu/id/3/7c86e0c5-fd67-4a60-be52-b98639e13f71”

}

]

Acknowledgements

We thank Katie Corker, Malte Elson, Leon ter Schure, Diana van Bergen, and Christina Elsenga for their input on previous versions of this article.

Transparency

Action Editor: Rogier Kievit

Editor: David A. Sbarra

Author Contributions

Don van Ravenzwaaij: Conceptualization; Formal analysis; Investigation; Methodology; Project administration; Writing – original draft.

Marlon de Jong: Investigation; Methodology; Validation; Writing – review & editing.

Rink Hoekstra: Conceptualization; Methodology; Writing – review & editing.

Susanne Scheibe: Methodology; Resources; Writing – review & editing.

Mark M. Span: Conceptualization; Methodology; Writing – review & editing.

Ineke Wessel: Methodology; Resources; Writing – review & editing.

Vera Ellen Heininga: Conceptualization; Methodology; Writing – review & editing.

ORCID iDs

Don van Ravenzwaaij

Rink Hoekstra

Ineke Wessel

Vera Ellen Heininga

References

Adams

Jones

Foster

(2023). Supporting FAIR data management planning across different disciplines at the University of Sheffield. Data Science Journal, 22, Article 17. https://doi.org/10.5334/dsj-2023-017

American Psychological Association. (2001). Publication manual of the American Psychological Association (5th ed).

Artner

Verliefde

Steegen

Gomes

Traets

Tuerlinckx

Vanpaemel

(2021). The reproducibility of statistical results in psychological research: An investigation using unpublished raw data. Psychological Methods, 26, 527–546. https://doi.org/10.1037/met0000365

Behan

Jeanson

Cheema

Eng

Khimji

Vaccarino

A. L.

Gee

Evans

S. G.

MacPhee

F. C.

Dong

Shahnazari

Sparks

Martens

Lasalandra

Arnott

S. R.

Strother

S. C.

Javadi

Dharsee

Evans

K. R.

. . . Mikkelsen

(2023). FAIR in action: Brain-CODE-A neuroscience data sharing platform to accelerate brain research. Frontiers in Neuroinformatics, 17, Article 1158378. https://doi.org/10.3389/fninf.2023.1158378

Desai

Ritchie

Welpton

(2016). Five safes: Designing data access for research (Economics Working Paper Series 1601). Department of Accounting, Economics and Finance, Bristol Business School, University of the West of England. https://csrm.cass.anu.edu.au/sites/default/files/rsss/Ritchie_5safes.pdf

El Emam

Dankar

F. K

. (2008). Protecting privacy using k-anonymity. Journal of the American Medical Informatics Association, 15, 627–637. https://doi.org/10.1197/jamia.M2716

El Emam

Jonker

Arbuckle

Malin

. (2011). A systematic review of re-identification attacks on health data. PLOS ONE, 6, Article e28071. https://doi.org/10.1371/journal.pone.0028071

Finnish Social Science Data Archive. (2024). Anonymisation and personal data. https://www.fsd.tuni.fi/en/services/data-management-guidelines/anonymisation-and-identifiers/#bases-of-anonymisation

Fung

B. C.

Wang

Chen

P. S.

(2010). Privacy-preserving data publishing: A survey of recent developments. ACM Computing Surveys, 42, 1–53. https://doi.org/10.1145/1749603.1749605

10.

GDPR.eu. (2022). What is GDPR, the EU’s new data protection law? https://gdpr.eu/what-is-gdpr/

11.

Grodin

M. A.

Annas

G. J.

(1996). Legacies of Nuremberg: Medical ethics and human rights. JAMA, 276, 1682–1683. https://doi.org/10.1001/jama.1996.03540200068035

12.

Holmes

E. A.

James

E. L.

Coode-Bate

Deeprose

(2009). Can playing the computer game “Tetris” reduce the build-up of flashbacks for trauma? A proposal from cognitive science. PLOS ONE, 4(1), Article e4153. https://doi.org/10.1371/journal.pone.0004153

13.

Houtkoop

B. L.

Chambers

Macleod

Bishop

D. V.

Nichols

T. E.

Wagenmakers

E. J.

(2018). Data sharing in psychology: A survey on barriers and preconditions. Advances in Methods and Practices in Psychological Science, 1(1), 70–85. https://doi.org/10.1177/2515245917751886

14.

Huston

Edge

V. L.

Bernier

(2019). Open science/open data: Reaping the benefits of open data in public health. Canada Communicable Disease Report, 45, 252–256. https://doi.org/10.14745/ccdr.v45i10a01

15.

Jacobsen

de Miranda Azevedo

Juty

Batista

Coles

Cornet

Courtot

Crosas

Dumontier

Evelo

C. T.

Goble

Guizzardi

Hansen

K. K.

Hasnain

Hettne

Heringa

Hooft

R. W. W.

Imming

Jeffery

K. G.

. . . Schultes

. (2020). FAIR principles: Interpretations and implementation considerations. Data Intelligence, 2, 10–29. https://doi.org/10.1162/dint_r_00024

16.

James

E. L.

Lau-Zhu

Clark

I. A.

Visser

R. M.

Hagenaars

M. A.

Holmes

E. A.

(2016). The trauma film paradigm as an experimental psychopathology model of psychological trauma: Intrusive memories and beyond. Clinical Psychology Review, 47, 106–142. https://doi.org/10.1016/j.cpr.2016.04.010

17.

Janssen

Charalabidis

Zuiderwijk

(2012). Benefits, adoption barriers and myths of open data and open government. Information Systems Management, 29, 258–268. https://doi.org/10.1080/10580530.2012.716740

18.

JASP Team. (2022). JASP (Version 0.16.3) [Computer software]. https://jasp-stats.org/

19.

LCRDM. (2019). Risicomanagement voor onderzoeksdata over mensen [Risk management for research data about people] https://www.lcrdm.nl/files/lcrdm/2020-01/LCRDM%20Risicomanagement%20voor%20data%20over%20mensen.pdf

20.

Morehouse

Kurdi

Nosek

B. A.

(2024). Responsible data sharing: Identifying and remedying possible re-identification of human participants. American Psychologist. Advance online publication. https://doi.org/10.1037/amp0001346

21.

Nethics (2018). Ethical code of the national ethics council for social and behavioural sciences. https://nethics.nl/gedragscode-ethical-code

22.

Nosek

B. A.

Spies

J. R.

Motyl

(2012). Scientific utopia: II. Restructuring incentives and practices to promote truth over publishability. Perspectives on Psychological Science, 7, 615–631. https://doi.org/10.1177/1745691612459058

23.

Nuijten

M. B.

(2019). Practical tools and strategies for researchers to increase replicability. Developmental Medicine & Child Neurology, 61, 535–539. https://doi.org/10.1111/dmcn.14054

24.

Portage COVID-19 Working Group; Thompson

Clary

Costanzo

Knazook

Rochlin

Tayler

Fry

Ripp

Szigeti

Zhang

Reka

Wang

Dickson

Leggott

Parlette-Stewart

. (2020). De-identification guidance (Version 2). Zenodo. https://doi.org/10.5281/zenodo.4270551

25.

Ramachandran

Bugbee

Murphy

(2021). From open data to open science. Earth and Space Science, 8, Article e2020EA001562. https://doi.org/10.1029/2020EA001562

26.

Sadeh

Denejkina

Karyotaki

Lenferink

L. I. M.

Kassam-Adams

(2023). Opportunities for improving data sharing and FAIR data practices to advance global mental health. Cambridge Prisms: Global Mental Health, 10, Article e14. https://doi.org/10.1017/gmh.2023.7

27.

Scheibe

Retzlaff

Hommelhoff

Schmitt

(2024). Age-related differences in the use of boundary management tactics when teleworking: Implications for productivity and work-life balance. Journal of Occupational and Organizational Psychology, 97(4), 1330–1352. https://doi.org/10.1111/joop.12512

28.

Schultes

Magagna

Hettne

K. M.

Pergl

Suchánek

Kuhn

(2020). Reusable FAIR implementation profiles as accelerators of FAIR convergence. In Grossmann

Ram

(Eds.), Advances in conceptual modeling. ER 2020. Lecture notes in computer science (Vol. 12584, pp. 138–147). Springer. https://doi.org/10.1007/978-3-030-65847-2_13

29.

Towse

J. N.

Ellis

D. A.

Towse

A. S.

(2021). Opening Pandora’s box: Peeking inside psychology’s data sharing practices, and seven recommendations for change. Behavior Research Methods, 53, 1455–1468. https://doi.org/10.3758/s13428-020-01486-1

30.

van der Burgt

de Laat

Jansen

Lamers

Marcoux

Slouwerhof

. (2024). Anonymisation plan template (v1.0). Zenodo. https://doi.org/10.5281/zenodo.10782781

31.

Verwoerd

Wessel

de Jong

P. J.

Nieuwenhuis

M. M. W.

Huntjens

R. J. C.

(2011). Pre-stressor interference control and intrusive memories. Cognitive Therapy and Research, 35(2), 161–170. https://doi.org/10.1007/s10608-010-9335-x

32.

Wicherts

J. M.

Borsboom

Kats

Molenaar

(2006). The poor availability of psychological research data for reanalysis. American Psychologist, 61, 726–728. https://doi.org/10.1037/0003-066X.61.7.726

33.

Wilkinson

M. D.

Dumontier

Aalbersberg

I. J.

Appleton

Axton

Baak

Blomberg

Boiten

J.-W.

da Silva Santos

L. B.

Bourne

P. E.

Bouwman

Brookes

A. J.

Clark

Crosas

Dillo

Dumon

Edmunds

Evelo

C. T.

Finkers

. . . Mons

(2016). The FAIR guiding principles for scientific data management and stewardship. Scientific Data, 3, Article 160018. https://doi.org/10.1038/sdata.2016.18

34.

World Medical Association (2013). World Medical Association Declaration of Helsinki: Ethical principles for medical research involving human subjects. JAMA, 310, 2191–2194. https://doi.org/10.1001/jama.2013.281053

35.

Cox

(2017). CSIRO 5-Star data rating tool [Computer software]. https://data.csiro.au/collection/csiro:27133v5

De-Identification When Making Data Sets Findable,Accessible,Interoperable,and Reusable (FAIR): Two Worked Examples From the Behavioral and Social Sciences

Abstract

Keywords

Implementing the FAIR Principles

Challenges in Sharing FAIR Data in the Behavioral and Social Sciences

Balancing Privacy and Research: Effective De-Identification Techniques for Human-Subject Data in Behavioral and Social Sciences

Outline of the Present Tutorial

Part 1: Step-by-Step De-Identification Process

De-Identifying and Publishing Data: Two Examples

Step 1: Describe Your Data Set

Step 1A: describe your participant pool

Data Set 1

Data Set 2

Step 1B: describe the age of your data set

Data Set 1

Data Set 2

Step 1C: distinguish between essential and optional variables

Data Set 1

Main and secondary analyses

Manipulation check

Inclusion criteria

Data Set 2

Main and secondary analyses

Manipulation check

Inclusion criteria

Step 1D: identify and classify the identifiability of variables in your data set

Data Set 1

Data Set 2

Step 1E: determine other sources of information that might influence the identifiability of individuals in your data set

Data Set 1

Data Set 2

Step 2: Determine the Identifiability of Your Data Set

Step 2A: weigh the perceived risk and utility of the variables in your data set using the five-safes framework

Data Set 1

Data Set 2

Step 2B: consult the local data steward and/or privacy officer to assess identifiability

Step 2C: determine whether it is useful to apply quantitative methods to quantify the identifiability of your data set (e.g., k-anonymity, t-closeness, l-diversity) and apply appropriate quantification methods

Step 3: Design De-Identification Techniques for the Variables in Your Data Set

Step 3A: remove (direct) identifiers from your data set

Step 3B: define and apply appropriate de-identification techniques to variables in your data set

Step 4: Go Back to Step 2 Until the Data Are Sufficiently De-Identified

Step 5: Document the De-Identification Procedure and Archive/Publish This With the Data

Part 2: Making Anonymized Data Publicly Available While Adhering to the FAIR Principles

Findable

Principle F1: (meta)data are assigned a globally unique and persistent identifier

Principle F2: data are described with rich metadata

Principle F3: metadata clearly and explicitly include the identifier of the data it describes

Principle F4: (meta)data are registered or indexed in a searchable resource

Accessible

Principle A1: (meta)data are retrievable by their identifier using a standardized communications protocol

Subprinciple A1.1: the protocol is open, free, and universally implementable

Subprinciple A1.2: the protocol allows for an authentication and authorization procedure when necessary

Principle A2: metadata are accessible even when the data are no longer available

Interoperable

Principle I1: (meta)data use a formal, accessible, shared, and broadly applicable language for knowledge representation

Principle I2: (meta)data use vocabularies that follow FAIR principles

Principle I3: (meta)data include qualified references to other (meta)data

Reusable

Principle R1: (meta)data are richly described with a plurality of accurate and relevant attributes

Subprinciple R1.1: (meta)data are released with a clear and accessible data usage license

Subprinciple R1.2: (meta)data are associated with detailed provenance

Subprinciple R1.3: (meta)data meet domain-relevant community standards

Discussion

Concluding Remarks

Footnotes

Appendix

Acknowledgements

Transparency

ORCID iDs

References