Abstract
Health organisations use numerous different mechanisms to collect biomedical data, to determine the applicable ethical, legal and institutional conditions of use, and to reutilise the data in accordance with the relevant rules. These methods and mechanisms differ from one organisation to another, and involve considerable specialised human labour, including record-keeping functions and decision-making committees. In reutilising data at scale, however, organisations struggle to meet demands for data interoperability and for rapid inter-organisational data exchange due to reliance on legacy paper-based records and on the human-initiated administration of accompanying permissions in data. The adoption of permissions-recording, and permissions-administration tools that can be implemented at scale across numerous organisations is imperative. Further, these must be implemented in a manner that does not compromise the nuanced and contextual adjudicative processes of research ethics committees, data access committees, and biomedical research organisations. The tools required to implement a streamlined system of biomedical data exchange have in great part been developed. Indeed, there remains but a small core of functions that must further be standardised and automated to enable the recording and administration of permissions in biomedical research data with minimal human effort. Recording ethical provenance in this manner would enable biomedical data exchange to be performed at scale, in full respect of the ethical, legal, and institutional rules applicable to different datasets. This despite foundational differences between the distinct legal and normative frameworks is applicable to distinct communities and organisations that share data between one another.
Introduction
The large-scale processing of health and biomedical data has been hailed as a paradigm shift in the practice of biomedical research. The storage and processing of data benefits from being scalable, and therefore much more cost-effective than traditional bench science. 1 The administrative and societal infrastructures that support large-scale biomedical research, however, seldom benefit from the same potential for scale. Critical challenges in unlocking the benefits of large-scale biomedical research include ensuring the technical compatibility of research datasets from distinct research sites, and coordinating the storage and sharing of data to enable its future downstream use. These arise in large part due to the non-technical elements of such efforts. Examples thereof include building consensus on shared technical standards, ensuring access to the human and financial resources required to sustain data exchange platforms for long periods of time, and aligning data uses to prevalent social mores and to the demands of regulators.
The potential to generate, process, and share biomedical data at scale creates considerable societal value. This has been recognised in a wide array of literature, which describes the contribution of existing data to biomedical research, clinical care, and public health surveillance. However, scalable data use can be difficult to achieve in full respect of the applicable ethical, legal, and institutional requirements. Performing onerous compliance activities, in reliance on cost-intensive human labor, can frustrate the potential to share and analyse data at scale. Nevertheless, it is equally unpalatable to advocate for the repeal of regulations, or for non-compliance, to achieve more biomedical data processing.
This article proposes a series of infrastructures to address this conundrum. Organisations engaged in the bilateral transfer of data might find the promise of scalable data analysis hindered due to the cost-intensive ‘human elements’ that underpin information exchange, including norm compliance. Large-scale groups of collaborating organisations can alleviate these burdens using a combination of organisational practices, policy documents, and technological systems, to record, track and assess the ethical and legal restrictions applicable to datasets in a comparable manner. This system is referred to, as a whole, as an ‘ethical provenance record’ – a combination of technological and organisational structures that enable groups of researchers to compare and monitor the national biomedical ethics requirements, legal commitments and institutional conditions of use applicable to their respective datasets, despite these rules being expressed in the incommensurable language of distinct normative frameworks.
Part 1: Ethical provenance records
Enabling the re-utilisation of biomedical and health-related datasets that are generated through research funding efforts is critical to maximising the downstream benefit that research involving human participants generates. The challenges confronted are numerous. These include obtaining the financial, human and computational storage resources needed to generate data and to retain it for long periods of time. Others include ensuring that the data which numerous distinct contributors generate can be compared in a meaningful manner, through both ex-post and ex-ante efforts to format the data in an interoperable or harmonised fashion (i.e., to enable the meaningful comparison and combination of data across different research efforts or research sites). The legal and ethical rules applicable to data use must also be assessed to determine the boundaries of appropriate data use.
Efforts to overcome these challenges have succeeded through the piecemeal collaboration of numerous stakeholders in the research ecosystem, rather than through a central, concerted solution. Research funding agencies use funding policies and data sharing requirements to incentivise the open dissemination of research data. Centralised and decentralised data repositories pool resources from multiple organisations to enable the storage of data, and to perform the cost-effective administration of downstream access thereto. Open science communities, research consortia, and representative bodies of researchers develop policies and tools to enable the formatting of data in manners that are compatible. Some universities further stimulate the open dissemination of data and other scientific research outputs in requiring researchers to share those products, as a precondition to career advancement. Together, these distinct organisational practices ensure that repositories containing large quantities of biomedical data emerge, and that the data these repositories host is suitable for combined use across multiple downstream research contexts (Devriendt et al., 2022).
One of the major remaining challenges that affects the potential to generate, centralise, and repurpose data for compatible downstream uses is ensuring that proposed data sharing remains compliant with applicable norms. These are heterogeneous in nature, and continue to proliferate and to change across time. Performing regulatory compliance efforts is cost-intensive, and becomes increasingly so as additional stakeholders and jurisdictions are implicated in a proposed data sharing effort.
Resource-intensive legal compliance efforts exist in difficult tension with the inherent promise of scalable information exchange and information processing. Legal harmonisation and legal simplification alone cannot resolve this dilemma. Some distinctions in the data use conditions applicable to different datasets arise from the choices of researchers or research organisations, in determining the contracts and policies applicable to data. These could, presumably, be harmonised. Nonetheless, much of the heterogeneity in the norms applicable to data use often reflect true policy disagreements between different legislatures or organisations as to the appropriate limits of information use that cannot be resolved through harmonisation. Heterogenous rules of information use, applicable to research data, will remain. Our preoccupation, therefore, is whether or not it is possible to meet the demand for the cost-effective, scalable use of biomedical and health information, despite the continuous evolution of contrasting and context-sensitive norms, each applicable in distinct circumstances of information use.
The remainder of this paper describes a holistic policy approach to the collection and recording of ethical and legal provenance metadata, and to its harmonised comparison across distinct health and biomedical data repositories by using common algorithms that implement shared decision-making heuristics at scale.
Recording ‘ethical provenance’ refers to the practice of creating a record of the ethical, legal and institutional rules applicable to data use, that are applicable to specified datasets. This also includes updating such records as the permissions applicable to the concerned datasets change, and ensuring that the concerned datasets are labelled with standard-form descriptions. It is argued that this would achieve the following public policy goals.
Distinct studies and cohorts comprised of multiple studies, or repositories containing the data from multiple cohorts, often subject their data to stewardship rules that are not compatible. In addition, prospective downstream users bear high costs in finding datasets that are suitable for their intended data uses. For example, the euCanSHare effort catalogued the data stewardship practices of 27 distinct population health cohorts in Canada and Europe. This evaluation illustrated that it was cost-intensive to retrospectively assess the data stewardship rules applicable to multiple distinct population health cohorts. High costs (time, labor and expertise) arose in evaluating and confirming the data stewardship rules applicable to each dataset. High costs also arose in comparing the data stewardship rules applicable to multiple datasets to determine if these are compatible. This effort also demonstrated that the researchers producing multiple population health cohorts, acting independently, would most often develop incompatible data stewardship rules (Bernier and Knoppers, 2021).
These results provide empirical evidence for the proposition that organisations must agree on shared methodologies in developing data stewardship rules for their respective data holdings. Elsewise, the transaction costs and administrative burdens inherent in accessing data will preclude data reuse. The remainder of this proposal describes infrastructures – organisational practices, technological systems, and administrative support, that enable communities of upstream biomedical data contributors and downstream biomedical data users to better communicate regarding the conditions according to which available biomedical data can be reused.
Part 2: Producing and respecting ethical provenance records
Key processes that enable the maintenance of a clear chain of ethical provenance and the automation of data access oversight include (1) informed consent processes, (2) the description of data use permissions, (3) the elaboration of data access agreements, and (4) the design of data access committees. Automation, in this context, refers not only to the computational/technological automation of data governance efforts, but also the creation of tools that enable humans to perform data access oversight in a more cost-effective and less burdensome manner.
Although there remain gaps, the architecture of a system devised to automate the comparison of the ethical and legal conditions of data use has been devised in great part. Organisations dedicated to the international sharing of biomedical data, and the development of infrastructure for biomedical data analysis, have already begun to create and adopt template documents, self-assessment forms, software tools and administrative bodies to facilitate the creation of compatible data stewardship rules across different biomedical data repositories. By exploring the initiatives below, we hope to show how tools can be used to create a holistic chain of ethical provenance and to facilitate or automate the comparison of the rules of use applicable to independently generated datasets.
Two strategies have been developed to respond to existing limitations in human-initiated and organisation-specific development, recording, and comparison of applicable data governance conditions. The first, standardisation, enhances the compatibility of datasets that are not generated for the express purpose of compatible use, or that are subject to the stewardship of different research organisations. The second, automation, reduces the costs inherent in: enabling parties to contribute data to a research data repository; obtaining access to data hosted on such a repository; and performing the oversight of data access requests. Full or partial automation is often achieved in delegating responsibilities otherwise left to human judgment to an algorithm or to a standard-form set of instructions.
Standardisation efforts in the area of biomedical data governance have been initiated by numerous organisations, including the Global Alliance for Genomics and Health (GA4GH). Outputs include informed consent form (ICF) templates, best practices for the creation of informed consent materials, core consent elements, template contracts, and self-assessment tools directed to specific scientific communities rather than to specific research data repositories (Global Alliance for Genomics and Health, n.d.). In adopting these shared templates and tools, scientific communities can lower the barriers and costs associated with sharing data amongst themselves. This approach facilitates data stewardship by attempting to align the rules applicable to the use of each concerned dataset as much as possible, through the use of informed consent to research participation, contracts and other similar mechanisms.
Consent
Ensuring consent to data acquisition, data use, and data sharing is often a precondition to the contribution of data to a research repository for its reuse, or its ad-hoc sharing for the purpose of future utilisation (Kaye and Prictor., 2021). It is usually a requirement of both local law and of applicable national biomedical research ethics guidelines. Consent empowers participants to determine the future treatment of their biomedical data once they have decided to authorise its use for research purposes. Maintaining accurate and reliable documentation of research participant consent is therefore a core prerequisite to developing records of ethical provenance.
The study-specific or cohort-specific drafting of informed consent materials often imposes unintentional or case-specific limitations on the future use of information. This occurs because there are high administrative and human costs to the development of these specialised documents. This means that researchers may place unintentional limitations on the potential to make future research uses of data if the drafting thereof is not performed with great attention to maximising the future potential to utilise the data.
Furthermore, informed consent materials may be in the form of paper records. This can lead data to be lost to future re-use if the informed consent materials related to a dataset are not conserved. The use of paper-based records also imposes high costs on the verification of applicable consent permissions, and can result in these verification procedures being performed in an inaccurate manner, through human error. Intended data re-use can be abandoned if the labor requirements inherent in performing human verification of informed consent materials are so high as to be unpalatable (Velmovitsky et al., 2021; Mackey et al., 2019).
These issues are likely to be exacerbated if re-consent is required, upon a request to share the data with another research organisation (Velmovitsky et al., 2021). Re-consent may also be required to expand the scope of the research questions studied, to enable the collection of additional data elements, or to enable additional researchers to obtain access to the collected data. Without a suitable record of the initial consent forms used, or the conditions included therein, opportunities to perform re-consent might also be negatively impacted, resulting in possible research benefits remaining out of reach. This is the case because the binding national biomedical research ethics guidance applicable to research may require researchers to demonstrate that consent to re-contact was already acquired at the moment of initial data collection. This is often required to permit re-contact of research participants, so as to obtain another specific or more expansive consent. Finally, the inability to provide documentation of informed consent can also undermine any research findings. If it cannot be demonstrated to research publications that appropriate informed consent of research participants was obtained, publishing research results can prove difficult or impossible. All of these issues will be further intensified if cost-effective mechanisms to create records of ethical provenance are not constructed (Wooley, 2016).
It should be noted that some tools that aim to standardise the methods of obtaining and documenting informed consent, and to describe the permissible uses of data already exist. Examples include the tools and policies of the GA4GH. The tools developed include template informed consent materials that enable researchers to ensure that the same consent elements are represented in the consent materials of different research study sites or different research repositories (including across jurisdictions). Using these templates helps to ensure that distinct local research teams and research collaborations obtain informed consent to research participation according to compatible terms, even where the consent forms themselves differ (Global Alliance for Genomics and Health, n.d.). Differences in the form or content of informed consent materials of specific research cohorts or study sites are to be anticipated, because institutional practices, local regulations and the needs of particular research populations can require distinctions to be made. Standards therefore strive to ensure heightened or total compatibility between the contents of these forms, rather than to perform their total harmonisation.
The Machine-Readable Consent Guidance of the GA4GH is one such standard. This guidance document helps to ensure that informed consent materials are designed to collect informed consent to research participation that can be represented in a machine-readable format. This is achieved by incorporating a standard set of informed consent elements to informed consent materials that align to standardised Data Use Ontology (DUO) terms, which can be compared to one another through the use of a simple algorithm (Global Alliance for Genomics and Health, 2020).
We maintain that in the future, biomedical research organisations should implement the following practices. First, organisational personnel should be trained to maintain records of ethical provenance for datasets that are generated for research purposes. This means that the informed consent materials should be retained in electronic form, as should clear descriptions of the rules of use applicable to each such dataset. Second, internal policies and practices should be developed to explain what information need be included in these records of ethical provenance, and the format to be used in documenting it. Personnel should receive training to ensure that such is done consistently within a singular organisation. Third, common procedures and practices for developing records of ethical provenance, and shared technological systems enabling maintenance, should be implemented across distinct research organisations (Toga and Dinov, 2015).
Data permissions
Research data repositories often describe the minimum ethical and legal permissions that must apply to data prior to its deposit in a concerned repository. These elements help researchers both to determine whether research data is suitable for contribution to a repository, and to assess whether the applicable informed consent materials and other applicable data use rules are compatible with the anticipated data stewardship practices of the recipient repository. To this end, research data repositories use these minimum ethical and legal permissions, and associated self-assessment tools, to develop informed consent material templates compatible with the requirements of repositories, prior to research data being produced. This helps researchers ensure that different datasets are subject to common consent conditions (and other institutional or contractual conditions of data use), despite having been developed in regions or research organisations subject to distinct ethical, legal, and organisational requirements. This is a crucial precondition to creating a large-scale research cohort, or a meta-cohort composed of data from multiple research cohorts, that is nonetheless permissioned to be used for a common set of downstream research purposes. This approach is often used to prospectively generate research data according to common conditions of downstream use.
Other strategies are used to retrospectively verify whether pre-existing data, known as ‘legacy’ data, can be deposited to a repository. Research repositories use comparison tools entitled retrospective consent filters to verify if their ‘legacy’ datasets are subject to conditions of data stewardship that enable them to be contributed to the concerned repository. If such compatibility is not evident, the retrospective filters provide guidance to researchers in determining how additional permissions might be secured, to enable its deposit in the concerned research data repository. This tool helps to ensure that all research groups considering data contribution use similar criteria to evaluate whether or not their data can be contributed to a specified research data repository on the basis of the applicable informed consent, and the other applicable governance rules (Wallace, Kirby, and Knoppers, 2020).
Other standardisation efforts include the use of standard-form ‘ontologies’ to ascribe to datasets terms reflecting the specific governance conditions applicable thereto. Examples include Consent Codes (Dyke et al., 2016), and the Data Use Ontology (DUO) (Lawson et al., 2021). These tools enable organisations to adopt a common language to confirm the presence of certain minimum ethical and legal permissions applicable to their respective datasets, despite the differing ethical and legal conditions applicable to the use of each. 2 This facilitates the human-initiated or algorithmic comparison of permissions inherent in data for the purpose of determining whether distinct datasets can be used and/or shared for common research purposes (Cabili et al., 2021).
These initiatives facilitate the automation of biomedical data stewardship. In standardising the contents of the ethical and legal rules applicable to the use of distinct datasets, it becomes easier to automate the review of data access requests, in whole or in part. This is true because the number of different combinations of potentially applicable rules decreases as the conditions of data use become more streamlined. In adopting standardised representations of the principal ethical and legal rules applicable to the use of distinct datasets, it becomes easier to automate the review of data access requests in whole or in part (Thorogood, 2020). Indeed, it is easier to compare the contents of standard-form ontology terms that are applied to each dataset than it is to perform the human-initiated comparison of the underlying ethical and legal permissions applicable to each dataset (Thorogood, 2020). In addition, having a simple record of ethical provenance expressed in meaningfully comparable ontology terms helps to guarantee to research ethics committees and research participants that datasets have been used in compliance with the applicable informed consents and other applicable rules.
Already, efforts have been made to streamline and to automate the process of reviewing data access requests directed to one or to multiple data repositories. Certain such efforts include the creation of algorithms designed to match together datasets that are subject to compatible conditions of data use. To function, these algorithms often require the conditions applicable to data use to be represented using standard-form ontologies. More granular permissions that are described in qualitative terms are not machine-readable. The larger the number of datasets that are intended to be used in combination for a shared research purpose, the more difficult it becomes to assess the compatibility of the associated conditions of data use through traditional human-initiated evaluation. Therefore, as data-enabled biomedical research continues to leverage larger pools of data of disparate provenance, it becomes increasingly important to use standard-form ontologies to represent the ethical and legal conditions of data use to enable automated comparison (Cabili et al., 2021).
Research organisations should use standard-form ontologies to describe the data stewardship rules applicable to their data, instead of free-text descriptions. This would require them to have an internal procedure for ascribing ontology terms to datasets, and for developing their informed consent materials and organisational data stewardship rules in a manner that is compatible with the language of the concerned ontologies. Inter-organisational alignment as to the most relevant ontologies, and as to the technological systems used to record and compare these terms would also be required. This would enable ethics review processes and data stewardship activities to be performed in a compatible manner across different organisations.
Each research organisation that generates biomedical research data would be charged with selecting standardised governance metadata ontologies to be applied to all of their datasets, and for adopting methodologies to label their datasets with ontology terms in an internally consistent manner. Inter-organisational collectives and standardisation bodies bear much of the onus for developing shared methodologies and ontologies that multiple organisations can adopt. In standardising their practices for recording ethical provenance, these organisations ensure that the permissions assigned to datasets are comparable and interoperable.
The foregoing sections have addressed the policies and practices required to subject data to compatible rules of data use at the time of data generation and thereafter, prior to its deposit from upstream organisations to research data repositories. The following sections address the development of data access agreements at the organisational, multi-organisational, or repository level, for the purpose of ensuring that the conditions of downstream use applicable to each dataset are compatible.
Data access agreements
Research data repositories often develop contracts to enable the deposit of data and the release of data to external researchers, according to compatible contractual terms. This practice guarantees that upstream researchers depositing data, research data repositories hosting data, and downstream researchers accessing data are bound to compatible commitments regarding the acceptable uses of the concerned data.
Currently, most repositories create their own inbound Data Contribution Agreements and outbound Data Access Agreements (DAAs), and most organisations create their own bilateral or multilateral Data Sharing Agreements (DSAs). These are not verified relative to a common standard, and often differ between organisations. The creation thereof can be time-consuming, can require input from multiple actors, and the varieties that exist across organisations can create barriers to data sharing. This is an area where standardisation and automation efforts have not proven successful, so far. The lack of compatibility between the elements that different organisations integrate to their respective agreements, and the lack of common templates for their development or implementation, creates challenges in performing both standardisation and automation. This could prove a long-term impediment to the cost-effective administration of data access requests if distinct datasets remain subject to highly distinct contractual agreements. One potential solution to this issue is to design agreements at the level of the research repository, rather than at the level of the local research organisation. However, this solution does not enable distinct research repositories to easily adopt compatible procedures for standardising or streamlining data access oversight across these distinct repositories. That is, if distinct research organisations or repositories require applicants for data access to sign the specific data access agreements that each organisation has devised, efforts to render the respective data access procedures streamlined and compatible will be frustrated. To resolve these difficulties across multiple biomedical research communities, agreement on shared contractual templates or core contractual elements will be required across multiple organisations or repositories. In the alternative, networks of research data repositories can delegate responsibilities for data access oversight to a specified organisation or repository in the wider network, or can recognise the authority of each network participant to authorise data access requests directed to the data of all network participants (Bernier et al., 2022; Schatz et al., 2022)
This is the case even if informed consent materials and institutional data use conditions are made compatible for distinct research datasets. The organisational rules determining the purposes for which such data can be reused would thus be harmonised, or at least compatible. However, the process of accessing and reusing such data would not necessarily be streamlined, if the contractual agreements determining the use conditions remain non-harmonised. Indeed, the proliferation of distinct agreements establishing the contractual conditions according to which such data exchanges must occur creates challenges in streamlining data access request processes (Saulnier et al., 2019).
For example, if distinct research organisations require applicants for data access to sign an organisation-specific DAA, then applicants to datasets from multiple different organisations or repositories will need to replicate efforts in reviewing, agreeing to, and ensuring compliance with each distinct data access agreement. This remains true even if the conditions of data use applicable to each underlying dataset are made compatible. Responsibilities for developing and implementing DAAs that minimise the use limitations imposed on datasets lie on the research organisations that generate and steward their own research data. These responsibilities are all the more pronounced for large-scale research data repositories that act as data stewards for multiple data contributors. Inter-organisational collectives and standardisation bodies can support these efforts in developing standard-form contractual clauses and contract templates that research organisations and research data repositories can adopt.
Data access committees
Last, research data repositories often appoint data stewards to assess the data access requests of external researchers, where data is not made publicly available. These data stewards are sometimes the principal investigators that established the concerned research repository. In other instances, such data stewards are specialised Data Access Committees (DACs) that include experts qualified to assess the ethical, scientific, and technical aspects of a proposed research project for the purposes of determining whether these are compatible with the data access policies of the host data repository. DACs are often made-up of one or more individuals whose role is to oversee access to biomedical data, and that possess context-relevant expertise (in contrast to generalist Research Ethics Boards) (Shabani and Borry, 2016).
However, there is a lack of empirical evidence regarding DAC composition and the conditions DACs look for when determining whether or not data access requests should be permitted. In fact, there are no procedural standards that apply across all DACs. Therefore, it is sometimes difficult to ascertain or benchmark what repositories deem acceptable, and the circumstances that cause data access requests to be rejected (Shabani et al., 2016). It can consequently be difficult for applicants to ensure that their data access requests have been reviewed according to fair and consistent criteria, and to prepare their applications so as to maximise their chances of obtaining access to data. Inconsistencies in review practices can create major impediments to the progress of research. Due to the time-limited nature of research grants, being refused access to data or engaging in time-consuming negotiations regarding data access can preclude research from being performed altogether (Devriendt et al., 2022).
The Data Access Committee Review Standards (DACReS) subgroup of the GA4GH's Regulatory and Ethics Work Stream (REWS) is investigating the harmonisation/standardisation of DAC procedures applicable to the review of data access requests. This group published their first formal recommendations in 2021. The document stipulates that standardising DAC processes can result in greater trust and mutual recognition. It can guarantee a more stringent standard of data protection whilst also providing more equitable access to research data. The GA4GH REWS outlines both guiding and procedural standards that could be followed to standardise this process. This facilitates standardisation and automation if implemented, as it ensures that humans overseeing the requests can address them in a systematic and internally consistent manner (Global Alliance for Genomics and Health, 2021).
Part 3: Towards automated biomedical data stewardship
The foregoing policies, templates, tools, and organisations support a succession of procedures that can be used to ensure that research data is used in a manner that is compatible with the applicable ethical and legal permissions throughout its life-cycle. These instruments incentivise data contributors, data repositories and data recipients to engage in common self-assessments and align themselves to shared standards that can increase interoperability in regulatory, ethical, and procedural obligations across research organisations. These practices ensure that the same data use conditions apply to the data throughout its life-cycle, and continue to be respected by downstream recipients of shared data.
However, for all of their advantages, these current approaches to biomedical data stewardship have numerous limitations. First, performing the foregoing analyses and devising such tools on a per-repository basis is cost-intensive, requiring considerable financial resources, time, and specialised human labor. Second, these procedures are susceptible to human error, in that the misapplication of a policy, misinterpretation of a contract, or erroneous interpretation of a proposed data use can lead to a breach of the concerned ethical, legal, and organisational rules. Third, and most critically, while the development and utilisation of these tools streamlines the ethical and legal oversight inherent in data ingestion and data access for a specified data repository, they do not facilitate the repurposing of data across multiple research data repositories. Further, these tools do not provide clear guidance to link each part of the process: from the informed consent process to the description of permissions in a dataset and its deposit to a research repository, to the adjudication of data access requests. This can lead to the emergence of profound incompatibilities in the conditions of data use applicable to data that is held in different research data repositories. It can also require cost-intensive verification from ethical, legal, and scientific experts to determine whether or not data can be repurposed, in one self-same research project, from across different research repositories. The prospect of incompatibilities and high costs is heightened for those groups that have not developed a suite of policies, templates, and tools, fostering mutual compatibility of the conditions of data use applicable to their respective data holdings (Rahimzadeh et al., 2022).
In the prior parts, we have discussed the existing tools that help data repositories and networks of data repositories forestall this outcome through the adoption of common policies and processes. In this part, we detail the minimum required elements of a holistic policy infrastructure that would both enable research repositories to ensure the compatibility of the governance conditions associated to their datasets without prior coordination, and help to automate the process of confirming that conditions of downstream data use applicable to multiple datasets are compatible in reliance on automated processes and tools.
The architecture of a system intended to automate the comparison of the ethical, legal, and institutional conditions of data use has been devised in great part. Though it is out of scope of this article, considerable progress has been made in developing technical standards to enable the proposed public policy proposals, including user authentication and the recording of user permissions relative to specified datasets (Voisin et al., 2021), standard file formats and ontologies to ensure data interoperability (Siu et al., 2016), web-based data access request portals that enable DACs to operate in a cost-effective manner (Jensen et al., 2017). Platforms have also been created that leverage multiple such pre-existing standards to enable data to be queried and analysed across multiple organisations participating in a network (Freeberg et al., 2022; Dursi et al., 2021; Bujold et al., 2016). However, to date, no research data repositories have succeeded in automating the entire process of receiving a data access request, assessing it for compliance with the policies of the repository, and authorising access to the concerned data.
The following mechanisms would, in the view of the authors, enable the near-full automation of the process of receiving, reviewing, and responding to a data access request. These mechanisms require implementation through the different steps of: data generation and acquisition; contribution to a data repository; and data access oversight. The mechanisms would therefore require support from numerous different organisational stakeholders to succeed. The implementation of such a system would enable massive cost savings for biomedical research consortia in performing legal compliance processes. It would ensure that the data use conditions applicable to datasets from multiple organisations or multiple research consortia are compatible, even absent prior coordination or collaboration. To achieve this vision, the following measures would be required.
First, it is necessary for researchers intending to deposit data, and their organisations, to maintain clear records of the ethical and legal provenance of data. This would require them to keep copies of the informed consent forms, research protocols, and institutional policies that are applicable to the use of the data. It is also required for research organisations to maintain an understanding of the legal rules applicable to the use of their data, which might arise from private law (e.g., contract) or from public law rules (e.g., data protection). It would be best if these were electronic to avoid paper-based systems that rely on the administrative capacity of humans. These should be easily accessible, time-stamped (i.e., versioned), and stored in a manner that fosters data protection and anonymity, whilst also ensuring clarity and transparency. However, the authors acknowledge that this may not be possible for research organisations that have limited access to the required technological resources.
Second, data contributors and data repositories must collaborate to ensure that data ingested to repositories is subject to metadata ‘tags’ or ‘representations’ that capture the purposes for which such data can be utilised. For example, communities of prospective data contributors and of data repositories could align upon common ontologies that are used to describe the ethical and legal conditions applicable to data use. To ensure consistent application, these ontology terms would need to be supported in reliance on clear definitions against which prospective data depositors could compare the ethical and legal conditions of use applicable to their data.
Third, data repositories would need to use contracts and record-keeping systems to bind data depositors to their representations regarding the ethical and legal conditions of use applicable to their contributed data, and to maintain records of the ontology terms that are applicable to each dataset ingested. These should be consistent, balance protection and data sharing, and be administratively un-burdensome.
Data depositors would also be responsible for maintaining accurate records of the conditions of data use applicable to each contributed dataset, and for communicating changes in the concerned permissions to the repositories. Such changes might arise if the ethical or legal rules applicable to the datasets were to change. For example, this would include if national data protection laws were revised, or if an ethics waiver of informed consent requirement enabled more liberal use of data. These record-keeping and contractual systems could be implemented using decentralised ledger technologies and smart contracts (i.e., the oft-discussed blockchain technologies that underpin the functioning of cryptocurrencies). While these could also be implemented using centralised electronic records and conventional contracts, the attractiveness of decentralised technologies resides in their ability to create public, directly auditable records offering transparency and ongoing control absent significant investment in the maintenance of a central custodial organisation (Kostick-Quenet, K. et al., 2022; Mann et al., 2021; Drosatos and Kaldoudi, 2019). The proposed approach is intended to be technology-neutral; whether blockchain technologies or another mechanism are most appropriate to enable its implementation remains indeterminate at present.
Fourth, expert DACs (as opposed to REBs) would need to be retained to perform the elements of oversight that require human intervention. For example, authenticating applicants for data access to ensure that these users have the requisite institutional affiliations and/or research experience to remain accountable in the use of controlled-access data. Systems to assign legal identities in an automated, also decentralised fashion are being developed at a rapid pace. Nonetheless, these are not subject to widespread adoption. At present, these processes would still require human input in verifying the identities of researchers requesting access to data, and in confirming their fitness to perform research (e.g., confirmation of their expertise to perform academic research, and/or of their affiliation to a bona fide research organisation) (Shabani et al., 2016).
The verification of the scientific merit of proposed research efforts might also require case-specific human evaluation (e.g., assessing the scientific validity of a proposed research initiative) even as automated authentication systems gain traction. Applicable norms, including data protection legislation, national biomedical research ethics requirements, and liability rules, could require some minimal amount of human engagement in this process. However, it remains possible to verify data users just once, on a per-researcher or per-research project basis.
This process confirms that requests for data from multiple data repositories can be distributed to distinct research repositories for automated review once the human-initiated portions of the review have been performed. That is, personnel of a centralised data access portal could confirm the “authenticated” identities of applicants for data access once, and evaluate the scientific merit of the concerned research project. The respective custodians of each concerned dataset could then compare these standard-form requests against the permissions and restrictions applicable to their respective datasets in an automated manner (Voisin et al., 2021).
Fifth, data recipients would receive access to the concerned data and would be bound to the terms of a DAA, stipulating both (i) the conditions applicable to their use of the data and (ii) the purposes for which the data can be used. These should be established in the applicable ontology terms. For these purposes, it is beneficial for different data repositories to further harmonise the language of their DAA. This is critical because enabling scalable data access procedures but requiring each individual requesting access to data to understand and respect individual contractual commitments applicable to each dataset of distinct organisational provenance creates challenges in enabling the scalable use of data.
Conclusion
The literature on the standardisation and automation of ethico-legal oversight processes, and of organisational data stewardship practices, has highlighted numerous challenges. One common conclusion is that the automation of these practices is impracticable due to the qualitative and subjective framing of legal requirements, or the complexity and heterogeneity of applicable legal norms. Many are skeptical that any form of ethics review and governance oversight can be delegated to algorithms. Additionally, many of the tools that are often cited as enablers of automation, such as blockchain and smart contracts are new and thus-far subject to limited adoption, and are computationally intensive or otherwise cumbersome to implement. Our foregoing proposal challenges this paradigmatic view to some measure.
Organisations often devise internal policies and practices that distill their ethical, legal, and institutional commitments into actionable procedures, reflected in contractual commitments and policies. Inter-organisational arrangements are often negotiated to create stable practices across distinct jurisdictions. The principal challenge that exists in the context of biomedical data stewardship does not arise from inherent difficulties in navigating different national, local, and institutional laws, norms, and policies. Rather, the challenge arises from comparing jurisdiction-specific, organisation-specific and project-specific requirements in a cost-effective manner, and translating them into stable and transparent inter-organisational governance arrangements. We propose that there are many mechanisms that can be developed throughout the data lifecycle to enhance ethical provenance by linking each stage together at this level.
Biomedical research organisations have implemented specialised mechanisms to develop repository-specific data stewardship arrangements in a cost-effective manner. Such mechanisms include institution-specific bodies, and personnel who are responsible for addressing the ethical and legal aspects of data contribution. These also include the creation of repository-specific tools to determine whether or not the contribution of data to the concerned biomedical data repositories complies with the relevant policies and requirements (these include the tools discussed in the foregoing sections). Here, partial standardisation and partial automation are already ongoing processes.
Ours is not the techno-utopian mantra that technological innovation will prevail where human efforts have failed. Rather, we contend that the processes of ‘standardisation’ and ‘automation’ are at present performed through the human-initiated processes of developing policies, template documents, standard operating procedures, and routine methodologies for the approval and review of data access requests. These routine processes can be further streamlined in using computational systems to formulate, record, and compare respective permissions in biomedical datasets. These can create further efficiencies in operationalising the upstream contribution of data to repositories, and the downstream administration of data access request procedures.
We do acknowledge the skepticism towards full automation, and understand that attempting to fully automate using electronic records of ethical provenance, as well as algorithms to enable the comparison of applicable conditions, may remain out of reach at present. Remaining challenges include the following. Though mechanisms have been developed to create biomedical data repositories and to support the cost-effective stewardship of their data, equivalent mechanisms do not exist to enable the cost-effective comparison of permissions for data that is held in different repositories. This means that whilst it is cost-effective for upstream researchers to contribute data to data repositories, and for downstream researchers to access data that is held in repositories in reliance on the applicable data access procedures, it is not cost-effective for researchers to scale their requests for data access across multiple repositories. One other practical limitation is that some organisations might not have the technological infrastructure required to adopt these proposed systems.
Therefore, the purpose of the foregoing organisational and technical proposals must be understood as follows. First, to enable data stewardship activities to operate in a scalable, cost-effective manner. This reduces the amount of specialised labor and effort required for researchers to request access to data, and for data stewards to evaluate and return responses to such requests. Efficiencies are gained relative to the time and resources required for upstream research data contributors to draft and interpret data sharing policies, for researchers to formulate and submit data access requests, and for data stewards to adjudicate such requests. Second, to enable the submission of data access requests to multiple repositories through a singular data access request procedure, rather than requiring the duplication of data access request procedures and of oversight efforts. The holistic standardisation and automation of data access and oversight mechanisms is not required to achieve enhanced efficiencies in this respect. The partial standardisation and automation proposed will facilitate data access and data oversight, enhancing rapid and cost-effective researcher access to existing datasets for future research use.
The existing tools described above enable most of the required activities to be performed. Yet, most organisations have not implemented organisational processes to electronically record the ethical and legal ‘provenance’ of their data using standardised ontologies. Procedural guidelines should therefore be developed and adopted enabling distinct organisations to perform these activities in a uniform manner. Moreover, insufficient efforts have been made to standardise and streamline the DAA that repositories impose on researchers and their research organisations as a precondition to providing access to biomedical data. There is also no singular web platform or software tool that enables users to record and to compare information relating to the ethical provenance of data in a meaningfully comparable manner, across different research organisations. Our proposal could help translate present, disparate efforts to align organisational data governance practices into a holistic practice of systematised recording and comparison thereof. Such progress is imperative to ensure the accountability of scientific research organisation, to maximise the benefits of scientific research for patients and publics world-wide, and to unlock presently siloed data to enable future research efforts. After all, was that not the original purpose of data collection?
To conclude, the principal contribution of our paper is the following. We propose that organisation-specific and project-specific data stewardship mechanisms should not be assessed in a self-contained manner. Ensuring the interoperability of data stewardship rules and processes across multiple organisations and repositories is a prerequisite to enabling research participants, researchers, and public to contribute to, and reap benefit from, a shared international biomedical data commons. Creating purely local, bespoke stewardship practices for specified data holdings might sometimes be appropriate for vulnerable populations or groups that have a strong claim to the sovereign oversight of their data. However, if organisations and repositories develop bespoke data stewardship practices as a default proposition, this choice ceases to be value-neutral. It heightens the transaction costs and administrative costs inherent in data use, and withdraws such data from the international scientific commons. This denies local data contributors, and local data users, equitable participation in, and benefit from, international biomedical data flows. Organisations and data repositories acting as data stewards therefore have an ethical duty to consider the adoption of interoperable, and permissive, data stewardship rules as a standard operational practice. The pervasive, pan-organisational choice not to adopt interoperable data stewardship practices is a structural problem. It risks the balkanisation of flourishing open science communities, and the castigation of their shared resources to silos.
Footnotes
Acknowledgements
AB, BMK, and DZ acknowledge the funding support of the grant entitled EuCanSHare: ‘EuCanSHare: an EU-Canada joint infrastructure for next-generation multi–Study Heart research’/Recherche de Santé du Québec 278114/Canadian Institutes of Health Research 16033/European Commission (2018–2024). MRA acknowledges the funding support of NIH NHGRI grant 1U24HG011025-01A1.
Declaration of conflicting interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by the Fonds de Recherche du Québec - Santé, European Commission, Canadian Institutes of Health Research (grant number 278144, 825903 , 16033).
