Abstract
Objective
This study aims to conduct a bibliometric analysis of literature related to data governance and open sharing in the fields of life sciences and medicine, so as to clarify the characteristics of publications and explore research hotspots and trends.
Methods
A total of 2529 valid documents published in the Web of Science Core Collection database from 2000 to 2024 were included in this study. VOSviewer was used for co-occurrence analysis, while CiteSpace was employed for clustering, burst detection, and thematic evolution analysis.
Results
Between 2000 and 2024, the number of studies on data governance and open sharing in the fields of life sciences and medicine has increased annually, indicating the growing importance of research in this area. The USA led as the country with the most research output in this field. The University of Oxford was the institution with the highest publication volume, Amy L. McGuire was the most active author, and the
Conclusions
Topics such as the FAIR principles, ethical issues, public attitudes toward data sharing, data quality, databases, and big data analysis techniques are hotspots in this field. Potential research frontiers include the FAIR principles, data quality, public trust and attitudes toward data sharing, the application of artificial intelligence technology in data governance and sharing, governance and sharing of epidemiological and public health data, governance and sharing of data on chronic diseases such as diabetes, and the construction of data governance models.
Keywords
Synonyms
Introduction
In November 2021, the 41st session of the General Conference of the United Nations Educational, Scientific and Cultural Organization adopted the ‘Recommendation on Open Science’, marking a new phase in the global consensus on open science. 1 As a key element of open science, the open sharing of scientific data has become a focal point of attention for countries around the world. Current scientific research is moving toward a data-intensive, data-driven, and data-sharing direction, which poses higher demands on the volume and quality of open scientific data sharing. It also brings a series of issues and challenges, such as the ownership of data rights and the risks associated with data sharing. The FAIR principles, which stand for Findable, Accessible, Interoperable, and Reusable, have established the basic guidelines for data governance. However, how to more scientifically and normatively promote data sharing and governance still requires in-depth research and discussion.
Since the initiation of the Human Genome Project, omics technologies represented by next-generation sequencing and mass spectrometry have advanced rapidly. This has led to an exponential increase in vast amounts of life science omics data, including genomics, transcriptomics, epigenomics, proteomics, and metabolomics. 2 The fields of life sciences and medicine are experiencing a profound transformation toward a data-intensive fourth paradigm of science. The data in the field of life science and medicine are characterized by enormous scale, wide variety, complex structure, and uneven quality, which makes it difficult to achieve high-dimensional and multi-level integration and sharing, thus obscuring the potential high value of scientific data. Furthermore, there is ambiguity in the ownership of individual-level health data, 3 and there are risks associated with human genetic resource-related data sharing. 4 Both nations and citizens remain skeptical about whether to share such data, which also calls for a more compatible approach to data governance. Additionally, there is a significant amount of data concealment behavior in the field of life sciences and medicine. Although data sharing among researchers can create significant public value, the potential loss of scientific leadership and economic benefits largely hinders the sharing of valuable data. 5
Despite the various challenges facing scientific data governance and open sharing, the scientific field is transitioning from a traditional closed research paradigm to a comprehensive open science paradigm, and the sharing of scientific data is an irreversible trend. In the 1980s and 1990s, the USA, Europe, and Japan each established one of the world's three major biological data centers: the National Center for Biotechnology Information (NCBI), the European Bioinformatics Institute (EBI), and the DNA Data Bank of Japan (DDBJ). These three data centers have greatly facilitated life science and biomedical research by sharing data resources submitted by third parties. In addition, the Cancer Genome Atlas (TCGA) database established by the National Cancer Institute in the USA and the national cohort UK Biobank (UKB) in the UK, have implemented tiered sharing of data produced by large-scale research projects. Moreover, numerous databases and knowledge bases that provide services for data query, browsing, download, and analysis have been established by small and medium-sized research team. 2
In an effort to encourage more extensive data sharing, numerous data sharing practices have been implemented by countries worldwide. The National Institutes of Health (NIH) began requiring data-sharing plans in 2003 for grant applications with an estimated annual cost of over $500,000. Other funding agencies and organizations, including the National Science Foundation, the Howard Hughes Medical Institute, and the Wellcome Trust, have followed suit. 5 The European Union's Open Data and the Reuse of Public-sector Information obliges member states to implement open access policies for research data generated by public funding. 1 In China, the Scientific Data Management Policy stipulates that scientific data generated from government budgetary funds should be made available for sharing with society and relevant departments as a norm, with non-sharing being the exception. 1 On the legal front, the General Data Protection Regulation (GDPR) of the European Union, which came into effect on 25 May 2018, has established rules for the protection of individuals with regard to the processing and the free movement of personal data, significantly impacting the fields of data governance and open sharing. Subsequently, the Data Governance Act, effective as of 23 June 2022, with its scope covering both personal and non-personal data, aims to promote data sharing across sectors and EU countries to harness the potential of data for the benefit of European citizens and businesses. 6
Scientific data are gradually evolving into a factor of production, research on data governance and open sharing is increasing day by day. As a significant domain for data generation, the field of life sciences and medicine has a research landscape in data governance and open sharing. However, the current state of data governance and open sharing in the life sciences and medical fields, the contributions of various entities, the evolution of research themes, the hot issues that scholars are most concerned about, and the future development direction of this research field are still unknown and await organization. To date, no comprehensive analysis of the published status, thematic hotspots, and evolutionary trends within this field has been identified, making it necessary to discuss and summarize the current state of affairs.
Bibliometric research provides an evidence-based quantitative analysis model to understand the knowledge structure, collaboration, and frontiers of research areas. It can reveal the collaboration between countries, institutions, and authors, perform citation counts and co-citation analysis, and analyze research hotspots and trends through keyword co-occurrence and burst detection.7–9 Bibliometric analysis can assist scholars in understanding the current state and trends of research in a particular field, providing references and insights for future research directions, and has been widely applied across various research areas. For instance, it has been used to study the global trends related to acute kidney injury in COVID-19, 10 to explore the current state, research progress, and prospects of artificial intelligence applications in wastewater treatment, 11 and to investigate the relationship between financial development and economic growth. 12 In the realm of data-related fields, Lee and Syn conducted a bibliometric analysis of the global research trends in research data management, 13 and Pradhan and Zala performed a comparative bibliometric analysis of global research literature on research data management in the Scopus and Web of Science databases from 2000 to 2019. 14 This study applies bibliometric methods to the analysis of literature related to data governance and open sharing in the life sciences and medical fields, with the aim of clarifying the collaboration, research status, and thematic evolution, as well as providing references and insights for future research in this area.
CiteSpace and VOSviewer are two Java-based information visualization software tools. This study utilizes these two tools in combination to conduct a comprehensive bibliometric analysis of publications related to data governance and open sharing in the field of life sciences and medicine, in order to explore the characteristics of publications, thematic hotspots, evolutionary trends, and future research directions in this domain.
This study analyzes the publication quantity, co-occurrence, clustering, and burst detection from the perspectives of entities such as countries, institutions, authors, journals, references, and keywords. The research objectives of this study primarily encompass the following three aspects:
Identifying the leading countries, institutions, and authors in the field of data governance and open sharing in life sciences and medicine, as well as their collaboration situation. Exploring the evolution and development of research themes within this field. Investigating the hotspots and forecasting the frontiers of research in the field.
Methods
Data source and literature search strategy
The Web of Science is the most comprehensive academic information resource globally, covering the largest number of disciplines, with over 12,000 core academic journals included. It is frequently used by researchers and widely recognized as a reliable and comprehensive source of academic information, making it the preferred choice for conducting bibliometric analysis.8,9 We found that using ‘Topic’ for the search yielded a large number of irrelevant and redundant documents, so this study employs ‘Title’ for literature retrieval, which not only ensures a sufficient sample size of documents but also guarantees the precision of the search. Additionally, we utilized the MeSH thesaurus of the PubMed database to search for synonyms of ‘data sharing’ and ‘data governance’, and ultimately determined the search strategy through screening. Moreover, ‘data governance’ is a term that has emerged in recent years, with ‘data management’ being more commonly used in the past. To fully present the evolution of the subject, this study included both ‘data governance’ and ‘data management’ in the search strategy. To ensure the timeliness of the research, this study retrieved literature from the Web of Science Core Collection database from 1 January 2000 to 24 March 2024. The search term was: TI = (‘Open Data Sharing’ OR ‘Data Openness’ OR ‘Data Sharing’ OR ‘Data Governance’ OR ‘Data Management’). To facilitate further analysis of the literature content, the search was limited to documents in ‘English’ and the document type was specified as ‘article’. The research areas were filtered to encompass disciplines within the life sciences and medical fields. The retrieved records were then imported into NoteExpress for deduplication, resulting in a final set of 2529 valid documents. The obtained documents were exported in ‘plain text file’ format, including full records and cited references.
Software for bibliometric analysis
This study primarily utilized CiteSpace 5.5.R2 and VOSviewer 1.6.18 as the tools for bibliometric analysis, with Excel and Scimago Graphica software used for visualization. Figure 1 presents the literature retrieval strategy, inclusion and exclusion criteria, and the analytical approach. VOSviewer was predominantly used for co-occurrence analysis of countries, institutions, authors, and keywords, while CiteSpace was mainly applied for clustering and burst detection analysis. Excel was employed for frequency statistics, and Scimago Graphica was used for generating geographical visualization maps.

Flowchart of the literature retrieval strategy, inclusion and exclusion criteria, and the analytical approach.
Results
Analysis of annual publication volume
Figure 2 illustrates the distribution of annual publication volumes related to data governance and open sharing in the fields of life sciences and medicine from the WOSCC database between 2000 and 2024, along with its correlation with an exponentially growing predictive model. Since the year 2000, the annual number of publications in this field has shown a fluctuating upward trend, peaking in the year 2021, after which there was a slight decline in the annual volume of publications. An exponential growth function was used to evaluate the relationship between the annual publication volume and the year of publication. The results showed that the model was in good agreement with the observed trend in publication volumes (

Annual publication volume and the exponential function predictive model.
Analysis of national publications
This study revealed the contributions and collaboration among countries/regions by counting the number of publications and conducting geographical visualization analysis. The top 10 countries/regions by publication volume and the collaboration map are depicted in Figure 3. According to the analysis of the included literature, between 2000 and 2024, a total of 133 countries/regions participated in the publication of documents in this field, forming 15 clustered networks. The USA ranked first with 1041 publications, significantly higher than the UK in second place (435) and Germany in third (250). The publication volumes of the other countries in the top 10 were relatively close, and the countries/regions not included in Figure 3a had publication volumes that did not exceed 100 documents. Due to the phenomenon of multinational collaboration in scientific research activities, collaborative papers are counted more than once, hence the sum of the publication volumes of the countries is greater than the total number of articles included. This study will count all the countries involved in the publication of a particular literature, with the final primary statistical indicator being the absolute number of publications each country has contributed to. The statistics for institutions and authors will be similar to this approach.

(a) Top 10 countries in publication counts. (b) Country collaboration network (a node represents a country, the links between nodes represent their collaboration relationships, different colors of nodes and links represent different research clusters).
The study combined VOSviewer and Scimago Graphica software to select the top 30 countries/regions by publication volume and mapped out an international collaboration network, which was then displayed using geographical visualization charts to depict the state of international collaboration more clearly. In Figure 3b, the color of the nodes corresponds to the countries/regions they represent, the size of the nodes indicates the volume of publications, and the darkness and thickness of the lines represent the strength of the collaboration, with the same color signifying a cluster. The most frequent collaborators were the UK (frequency = 938), the USA (frequency = 917), and Germany (frequency = 632). Notable strong collaborative relationships were observed between the USA and the UK, the USA and Canada, the UK and Germany, and the USA and Germany. Representative clustered collaboration networks include: (1) the USA, Canada, China, Australia, India, etc.; (2) the UK, Brazil, and Sweden, etc.; (3) Germany, France, Italy, Belgium, Denmark, Norway, etc. In the research on data governance and open sharing in the fields of life sciences and medicine, the USA, the UK, and Germany have made significant contributions and have engaged in frequent collaborations.
Analysis of institution publications
A visualization analysis of the institutions involved in publications revealed that between 2000 and 2024, a total of 4629 institutions participated in the research on data governance and open sharing in the fields of life sciences and medicine. Figure 4 displays the top 15 institutions by publication volume and the institutional collaboration network. The University of Oxford in the UK had the highest publication volume, followed by the University of Washington, Harvard Medical School, Duke University, and Stanford University in the USA. There were 124 institutions with 10 or more publications, which formed seven clusters. The main clusters included: (1) a red cluster predominantly consisting of Harvard University, Harvard Medical School, and Duke University from the USA; (2) a green cluster predominantly consisting of the University of Washington, University of California, San Francisco, and University of California, San Diego from the USA; (3) a blue cluster predominantly consisting of the University of Toronto and McGill University from Canada; (4) a yellow cluster predominantly consisting of Johns Hopkins University from the USA; (5) a purple cluster predominantly consisting of the University of Oxford, University College London, and the University of Manchester from the UK. It is evident that in the research on data governance and open sharing in the fields of life sciences and medicine, research groups led by universities from the USA, the UK, and Canada were highly engaged and frequently collaborated.

(a) Top 15 institutions in publication counts. (b) Institution collaboration network (a node represents an institution, the links between nodes represent their collaboration relationships, different colors of nodes represent different research clusters).
Analysis of author publications
A statistical analysis of author publication records revealed that between 2000 and 2024, a total of 14,484 authors participated in the publication of articles on data governance and open sharing in the fields of life sciences and medicine. Figure 5a presents the top 10 authors by publication volume, with Amy L. McGuire leading the list with 14 publications, followed by Lucila Ohno-Machado with 13 publications. There were 201 authors with three or more publications, who formed 57 clusters. Figure 5b illustrates the collaboration and clustering among these 201 authors. In the figure, the size of the circles represents the number of publications, the darkness of the lines indicates the strength of collaboration, and the colors represent different clusters. It is observable that research on data governance and open sharing in the fields of life sciences and medicine is often conducted in the form of small teams or individual work, with only a few teams showing intersecting collaborations. The majority of the work exists in the form of independent research by individual teams. This suggests that there is a need to enhance collaboration among individuals and between teams within the field, in pursuit of broader and larger-scale research partnerships.

(a) Top 10 authors in publication counts. (b) Author collaboration network (a node represents an author, the links between nodes represent their collaboration relationships, different colors of nodes represent different research clusters).
Analysis of journal publications
This study included literature from 1022 journals. Table 1 lists the top 10 journals by publication volume, along with their publication counts, countries of origin, five-year impact factors, and JCR categories. These publishers are predominantly from the UK and the USA, with all of the journals located in the Q1 and Q2 quartiles of the Journal Citation Reports. The most frequently published journals in the field of data governance and open sharing in life sciences and medicine are the
Top 10 journals in the field of data governance and open sharing in life sciences and medicine.
Analysis of references
Analysis of most cited references
The 2529 articles included in this study cited a total of 71,842 references. Table 2 presents the top 10 references by citation frequency. The most frequently cited document was the Comment published in Scientific Data titled ‘The FAIR Guiding Principles for Scientific Data Management and Stewardship’. 15 This Comment noted that a workshop named ‘Jointly Designing a Data Fairport’ was held in Leiden, the Netherlands, in 2014, where experts collaboratively drafted the FAIR principles. After refinement and improvement by the FAIR Working Group, this Comment officially released the FAIR principles for the first time. Among the remaining papers, two addressed the identification of individuals through genetic information,16,17 three collected the attitudes of the public or participants toward personal data sharing,18–20 one discussed the attitudes of scientists toward data sharing, 21 and the themes of the remaining three papers were as follows: discussing how to share data more responsibly, 22 advocating for data sharing to improve public health, 23 and GlaxoSmithKline's Clinical Trials decision to share clinical trial data. 24
Top 10 cited publications in the field of data governance and open sharing in life sciences and medicine.
Analysis of citation bursts
Bibliographic burst refers to a publication's citation count that is significantly higher than usual for a duration of at least two years, which can be used to explore emerging hotspots and research frontiers in a field of study.
25
The blue line in Figure 6 represents the observation period from 1990 to 2022, while the red line indicates the burst time of cited documents. The publication with the highest burst value between 2000 and 2024 was ‘Amounts of DNA to Highly Complex Mixtures Using High-Density SNP Genotyping Microarrays’, published in

Top 18 references with the strongest citation bursts (blue line represents the time intervals when cited literature appears, red bars indicate that the number of citations of the literature suddenly increased during that period).
Analysis of keywords
Frequency analysis of keywords
Keywords were extracted and their frequencies were counted for literature on data governance and open sharing in the fields of life sciences and medicine published between 2000 and 2024. Figure 7 lists the top 10 keywords by frequency. Among these keywords, ‘data sharing’ appeared 261 times, ranking first, ‘privacy’ ranked third, ‘data management’ ranked sixth, ‘information’ and ‘quality’ ranked ninth and tenth, respectively. ‘Care’, ‘mortality’, ‘risk’, ‘outcomes’ and ‘health’ were important characteristic terms in the fields of life sciences and medicine. It is evident that in the context of data governance and open sharing, privacy has received the highest level of attention, and researchers are also very concerned about the quality of data.

Top 10 most frequent keywords.
Co-occurrence analysis of keywords
Figure 8 presents a co-occurrence network constructed from high-frequency keywords, which can predominantly be categorized into three segments: the blue segment focused on ‘data sharing’, the green segment focused on ‘data management’, and the purple segment centered around characteristic terms of the life sciences and medicine fields such as ‘care’ and ‘health’. Within the data sharing module, keywords with high co-occurrence intensity included ‘privacy’, ‘ethics’, ‘consent’, ‘attitudes’ and ‘trust’. This indicates that privacy protection, informed consent, public attitudes, and trust toward data sharing, along with other ethical issues, are key factors affecting data sharing and are among the most closely watched topics by researchers in this area. In the data management module, keywords with high co-occurrence intensity included ‘big data’, ‘database’, ‘machine learning’, ‘cancer’ and ‘standards’. In the fields of life sciences and medicine, there exists a vast amount of big data that is distinguished by its volume, velocity, variety, and variability. Discovering and extracting valuable scientific data from big data, utilizing databases for storage, management, and application, and applying technologies such as machine learning for data analysis are key to data governance. The high correlation between the keywords of ‘cancer’ and data governance modules suggests that cancer may be a disease of particular concern in the context of data governance. Additionally, ensuring that data conforms to standard specifications is also a hot topic in data governance. In the life sciences and medical modules, keywords with high co-occurrence intensity included ‘care’, ‘health’, ‘outcomes’, ‘mortality’ and ‘risk’, indicating that data governance and open sharing in the field of life sciences and medicine primarily serve the health and well-being of individuals.

Keywords co-occurrence network (a node represents a keyword, the links between nodes represent the co-occurrence relationships between keywords, and the different colors of nodes and links indicate different thematic clusters).
Analysis of the research themes evolution
Keyword time zone maps reflect the evolutionary trends of research topics over time, revealing the updating of knowledge within research fields and their mutual influences. 27 The keyword time zone map, as shown in Figure 9, was divided into two-year intervals, where the size of the nodes represented the frequency of keyword co-occurrence, and the lines represented the relationships of co-occurrence between keywords. In the field of life sciences and medicine, the themes of focus from 2000 to 2009 included ‘therapy’, ‘mortality’, ‘prevalence’, ‘care’, ‘disease’ and ‘outcome’, which primarily emphasized passive treatment approaches to outcomes such as illness and death. From 2010 to 2019, the themes of focus shifted to ‘health’, ‘diagnosis’, ‘epidemiology’, ‘cancer’, ‘electronic health record’ and ‘genomics’, indicating a proactive pursuit of health from the entry points of chronic diseases, electronic health records, and genomics. Since 2020, the themes of ‘public health’ and ‘COVID-19’ have come to the forefront, mainly influenced by the COVID-19, which has led to increased attention on public health issues such as pandemics. On the research content level of data governance and open sharing, the primary focus from 2000 to 2007 was on ‘database’, ‘data management system’, and ‘bioinformatics’ related to data management. From 2008 to 2021, the focus shifted toward content related to data sharing, including ‘privacy’, ‘ethics’, ‘consent’, ‘attitude’, ‘trust’, and ‘safety’. Since 2022, the primary focus has been on content related to data quality and data governance. On the technological level of data governance and open sharing, ‘big data’ emerged in 2014, ‘blockchain’ in 2018, ‘artificial intelligence’ in 2020, and ‘machine learning’ in 2022.

Time zone view of keywords (taking two years as a time slice, the nodes within the slice represent the high-frequency keywords of that period, and the links between nodes represent the co-occurrence relationships between keywords).
Clustering analysis of keywords
Importing the literature records into the CiteSpace software, eight clusters were formed, as depicted in Figure 10. These clusters are ‘ethics’, ‘data management’, ‘outliers’, ‘data sharing’, ‘clinical trial’, ‘prevalence’, ‘biopsy’ and ‘acute coronary syndrome’. The ‘ethics’ cluster indicates that the research on data governance and open sharing in life sciences and medicine is primarily concerned with related ethical issues. The ‘outliers’ cluster suggests that outliers in the dataset may affect data quality and could potentially aid in identity recognition, leading to the disclosure of personal privacy. The emergence of the ‘clinical trial’, ‘prevalence’, ‘biopsy’ and ‘acute coronary syndrome’ clusters indicates that data from clinical trials, epidemiological diseases, cancer and tumor, and cardiovascular diseases may receive higher attention in data governance and open sharing due to their large volume, importance, and particularity.

Keywords clustering map (each color block represents a thematic cluster generated by keywords).
Citation burst analysis of keywords
Figure 11 presents the top 19 keywords with the highest burst intensity from 2000 to 2024. The keyword with the highest burst value was ‘COVID-19’, which emerged in 2021 and continued to the present, indicating that the sharing and governance of data related to novel coronavirus is significant to mitigate the spread of the COVID-19 pandemic. ‘Data management’ emerged between 2001 and 2004, while ‘data governance’ burst from 2022 to 2024. Data management focuses on the technical processing and operation of data, whereas data governance emphasizes strategic management and oversight. The evolution from data management to data governance reflects a shift in data's role from an initial tool for recording and storage to a critical asset and strategic resource in scientific research and social development. The evolution of burst terms across different time periods reveals that the construction of databases was a research focus during 2002–2010. From 2005 to 2011, researchers began to pay attention to the design and optimization of data management systems. Between 2015 and 2017, the sharing and management of clinical trial data became a research hotspot. During 2016–2019, the sharing and management of data related to quality of life received widespread attention. Data security issues were a significant concern from 2018 to 2022, with the application of blockchain and artificial intelligence in data sharing and governance becoming a hot topic. Keywords that burst until 2024 included ‘epidemiology’, ‘adult’, ‘attitude’, ‘public health’, ‘obesity’, ‘trust’, ‘covid-19’, ‘model’, ‘health’ and ‘data governance’. This suggests that epidemiology and public health, public trust and attitudes toward data sharing, sharing and governance of data related to chronic diseases such as diabetes, and the construction of data governance models may become future research hotspots and frontiers.

Top 19 keywords with the strongest citation bursts (blue line represents the time intervals when keywords appear, and red bars indicate the periods of a surge in citation volume).
Discussion
Overview of publications in the research area
During the period from 2000 to 2024, the annual number of publications on data governance and open sharing in the fields of life sciences and medicine has shown an overall trend of fluctuating growth. The number of publications surpassed 100 articles in 2014 and reached a peak in 2021 before experiencing a slight decline. The overall increasing trend in publication volume can be attributed to several potential reasons: (1) With the rapid development of omics technologies and information technology, the amount of data in life sciences and medicine has undergone an explosive growth, leading to a profound transformation toward a data-intensive fourth scientific paradigm. (2) The application of big data and artificial intelligence technologies has made the analysis and processing of massive data possible and convenient, 28 thereby promoting research on data governance and open sharing. (3) The rise of the open science movement has facilitated data sharing, with an increasing number of datasets becoming openly accessible, enhancing data transparency and availability. 29 (4) The growing attention to health and the increasing emphasis on life sciences and medical research worldwide have led to greater investment of resources by governments and research institutions, which has in turn stimulated research and publication in these areas. The annual number of published papers fell slightly after reaching a peak in 2021, which may be due to the transfer of research hotspots and fluctuations in research funds. It is also possible that some issues have impacted the progress and publication of research in the field, including data privacy protection, data classification and standardization, human genetic resource data sharing risk, and so on. Additionally, the trend in annual publication volume correlates highly with the exponential growth function prediction model, suggesting that there is a certain regularity and predictability to the growth of research in this field. This strong correlation may imply that as time goes on, the importance of data governance and open sharing will be further recognized and accepted, and the open sharing and scientific governance of data is an irreversible trend of the times.
In the context of national publication output, the USA, the UK, Germany and Canada are the leading countries in terms of publication volume and are also the most frequent collaborators with other nations. Eighty percent of the top 10 countries by publication volume are located in the Americas and Europe. Among the top 15 institutions in terms of publication output, nine are based in the USA, three in the UK, two in Canada, and one in Australia. Large-scale institutional clusters are primarily led by universities in the USA, the UK and Canada. Although Germany ranks third in publication output, none of its research institutions make it into the top 15 in terms of publication volume, nor has it formed a large-scale cluster dominated by German institutions. In terms of author publication output, four out of the top 10 authors are from the USA, three from the UK, two from Canada, and one from Australia, which is similar to the distribution of institutions. It is evident that in the field of data governance and open sharing in life sciences and medicine, the USA holds a leading position, with other high-contributing countries mainly located in the Americas and Europe. This may be related to their higher levels of economic development and investment in healthcare. Additionally, these countries have established a number of comprehensive data centers to facilitate the open sharing of life sciences and medical data, such as the NCBI, the EBI, and the Swiss Institute of Bioinformatics (SIB). They have developed relatively mature management systems and have fostered a robust data ecosystem.
The collaboration between countries shows a pattern of ‘transcontinental global linkage’ and ‘intracontinental regional clustering’. The former is exemplified by collaborations between the USA and the UK, the USA and China, and the USA and Australia. The latter is exemplified by collaborations between the USA and Canada, as well as within Western European countries. It can be observed that research in data governance and open sharing in the field of life sciences and medicine is primarily concentrated in economically developed countries. However, many developing countries, facing harsh living conditions and health issues, also have a demand for improving medical and health levels through data governance and open sharing. However, many developing countries face poor data governance and sharing conditions due to the lack of a universal data sharing platform or framework, the absence of guidelines or policies for data security and privacy protection, and the lack of awareness among researchers about the necessity of extensive data sharing. 30 It is recommended that developed countries enhance their radiating and leading role in this field, actively cooperate with institutions and scientists in developing countries, raise their awareness of data governance and sharing, support developing countries in participating in cross-border data flows, provide capacity building and technical assistance, and include the perspectives of developing countries in the formulation of global data governance and sharing rules, in order to reduce the digital divide and health inequalities, and promote a broader opening and sharing of data.
Analysis of the thematic evolution in the research area
The evolution of research themes in the field of life sciences and medical data governance and open sharing can be dissected from three perspectives. Firstly, examining the content of research within the life sciences and medical field, the focus of scholars has shifted from a disease-centric approach during 2000–2009 to a health-centric approach during 2010–2019. Since 2020, the emphasis has shifted toward epidemiology and public health issues. This reflects a paradigm change in the field from a ‘passive treatment’ to a ‘proactive health’ mindset, and from an ‘individual’ to a ‘population’ perspective. Governments, hospitals, and research institutions are urged to effectively manage and share resident health-related data, including physical examination data, chronic disease data, and nutritional data, to address scientific research demands centered on health management and promotion, as well as industrial needs for the development of health monitoring and management products. Concurrently, governments, hospitals, disease prevention and control agencies, and emergency management organizations should manage and share data on epidemics and public health, such as data related to the COVID-19 pandemic, to provide experiential references for potential future outbreaks and to deploy more proactive prevention and control measures.
From the perspective of research content in data governance and open sharing, the period from 2000 to 2007 marked a phase of prosperity for data management research. The years from 2008 to 2021 were characterized by a flourishing phase of data sharing research, and the period from 2022 to the present has seen the emergence of data governance. Data management involves the management of activities throughout the data lifecycle, while data governance encompasses the planning, decision-making, supervision, and control of data management. 31 The evolution from data management to data governance reflects the scholarly community's emphasis on data quality, security, and legal compliance, as well as their aspiration to create a sustainable data ecosystem. However, it is noteworthy that data management and data governance are closely interrelated and indispensable to each other. Although research focuses may vary across different periods, it is essential to ensure the rational coexistence of both. In order to achieve the goals of ensuring data quality and security, maximizing the value of data assets, and conducting data sharing and dissemination in a legal and compliant manner. Databases and data centers, in the process of storing, sharing, processing, and utilizing data, should not only manage the entire lifecycle of data at the micro-level but also adopt a more macro-perspective to facilitate the participation of diverse stakeholders in data governance. Additionally, they should integrate and utilize a variety of tools to enhance the quality, security, compliance, and ethicality of data.
In the realm of technological applications for data governance and open sharing, the year 2014 marked a significant turning point as big data and artificial intelligence dramatically entered the purview of scholars. Big data analytics can integrate diverse types of information, transforming vast amounts of data into actionable knowledge that aids in precision medicine, disease diagnosis, and risk warning. 32 Blockchain technology offers a potential decentralized distributed network for data sharing and governance, 33 yet it comes with inherent risks such as standards and interoperability issues, information privacy, and security concerns. Artificial intelligence possesses the capability to rapidly process large volumes of data as well as identify patterns and trends that may elude immediate human detection, thereby bringing additional possibilities to the governance and open sharing of data in the life sciences and medicine. Data sharing can also facilitate the collection of extensive data needed to train powerful and highly predictive AI models. However, the unique requirements for privacy and security in this domain impose certain limitations on data access, which to some extent hinders the development of robust AI tools. 34 Researchers in the fields of life sciences and medicine should keep abreast of and educate themselves on big data and artificial intelligence technologies. They should apply these technologies judiciously in the processing and utilization of data, ensuring that large volumes of data can serve the health needs of residents and contribute to societal well-being. Concurrently, they must be vigilant against the leakage and misuse of personal health-related data, paying close attention to the protection of data security and the privacy of residents.
Analysis of hot spots and frontiers in the research area
Analysis of highly cited publications, high-frequency keywords, keyword co-occurrences and clustering reveals that the FAIR principles, ethical issues such as informed consent and privacy protection, public attitudes toward data sharing, data quality, databases, and big data analytics are hot topics in the field of data governance and open sharing within life sciences and medicine. The burst detection feature of CiteSpace can identify emerging research frontiers. 35 Combining the burst detection of cited publications and keywords with the evolution trends of themes, potential research frontiers in this field are identified to include: the FAIR principles, data quality, public trust and attitudes toward data sharing, the application of artificial intelligence technology in data sharing and governance, sharing and governance of epidemiological and public health data, sharing and governance of data on chronic diseases such as diabetes, and the construction of data governance models.
The FAIR principles have clarified the objectives of scientific data management, and have gained widespread recognition from international stakeholders since their publication, marking a milestone in the development of scientific data guidelines. While the FAIR principles are not a sufficient set of principles for responsible data sharing, they are necessary. 36 Standardized data sharing in accordance with the FAIR principles forms the foundation for the application of new data-driven artificial intelligence analytical techniques. 37 Implementing the FAIR principles is crucial for enhancing data quality and maximizing the value of data. However, addressing ethical issues is a prerequisite and foundation for extensive data sharing. Internationally, significant attention is given to ethical concerns involved in data sharing, such as privacy protection and informed consent. Internationally, several open-access databases, such as the NCBI, UK Biobank, the Global Initiative on Sharing All Influenza Data (GISAID), Medical Information Mart for Intensive Care (MIMIC) and TCGA, adhere to specific legal and ethical standards. The NCBI, TCGA, and MIMIC, funded by the NIH in the USA, comply with the Health Insurance Portability and Accountability Act (HIPAA) regulations, which mandate the de-identification of Protected Health Information (PHI) to safeguard individual privacy. NCBI, UK Biobank, and GISAID ensure that all data processing activities are in accordance with the GDPR requirements, adhering to the principle of data minimization during data collection and emphasizing the protection of data subjects’ rights.
The public's attitude toward sharing their personal health data also significantly influences the process of data sharing. Data quality is another key factor affecting data sharing. Establishing an objective and systematic data quality management system is one of the core tasks of data governance. This system is essential to ensure the reliability of data, mitigate the risks associated with erroneous data, reduce the costs of data management, and enhance the utilization rate of data. 38 Big data and artificial intelligence technologies have demonstrated unique value in addressing some issues in data governance and open sharing. Some scholars propose methods such as federated learning and collaborative learning, which enable the collaborative training of machine learning models on distributed devices without disclosing sensitive data, thereby aiding in resolving data privacy and compliance issues. 39 Recently, model construction has become a research hotspot and frontier, with complex and diverse model-related studies. These include using data-driven models for anomaly prediction and maintenance of medical facilities 40 ; employing statistical predictive models to monitor and evaluate pediatric cancer data 41 ; and training AI models to help identify potential health risk factors and disease diagnostic targets, discover new drugs and vaccines, and develop personalized treatment plans. 39
The hot topics and frontiers in the research domain provide scholars with directions for future studies. There should be increased focus on data quality issues, ethical considerations in data usage, and the application of big data and artificial intelligence technologies. Concurrently, there should be a strengthened governance and sharing of data in critical areas such as public health and chronic diseases. Ethical issues surrounding data encompass informed consent, privacy protection, and public trust. The application of big data and artificial intelligence technologies should particularly concentrate on the utilization of large-scale data models in disease detection, diagnosis, and clinical treatment.
Conclusion
Between 2000 and 2024, the number of studies on data governance and open sharing in the fields of life sciences and medicine has increased annually, indicating the growing importance of research in this area. The USA leads as the country with the most research output in this field. The University of Oxford is the institution with the highest publication volume, Amy L. McGuire is the most active author, and the
Footnotes
Acknowledgements
The authors want to thank CiteSpace and VOSviewer for free access by researchers.
Contributorship
Zhimin Hu contributed to conceptualization, funding acquisition, project administration, supervision and writing–review and editing. Yanrui Qiu contributed to conceptualization, data curation, methodology, software, visualization and writing–original draft. All authors contributed to the article and reviewed the submitted version.
Data availability
The data in this study is not sensitive and is accessible in the public domain. All the data used in the study have been included in the article.
Declaration of conflicting interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Ethical approval
This article does not contain any studies with human participants or animals performed by any of the authors.
Funding
The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This research was supported by the Noncommunicable Chronic Diseases-National Science and Technology Major Project (2023ZD0509701), and Medical and Health Technology Innovation Project of Chinese Academy of Medical Sciences (2021-I2M-1-057).
Guarantor
The guarantor of the study, Zhimin Hu, had full access to all the data in the study and takes responsibility for the integrity of the data and the accuracy of the data analysis.
