Abstract
Objectives
Large language models (LLMs) are revolutionizing medical research. However, there is a lack of bibliometric analysis that identifies citation trends shaping the history of this field. This study analyzes the top 100 (T100) most-cited articles on LLMs in medicine to assess their impact and characteristics.
Methods
A bibliometric analysis of top-cited articles in the Web of Science database using search terms like “LLMs, generative artificial intelligence, GPT” from 2022 to 2025. Two reviewers identified the T100 papers, extracting publication details, citations, and research themes, adhering to BIBLIO reporting guidelines.
Results
The T100 articles had contributed from 655 authors, and 92 articles were published in 2023. Original research constituted the majority of publications (60 articles). Collectively, these works accumulated 14,847 citations, with individual citations ranging from 50 to 1057 (average 148.47). The U.S. led global contributions with 56 articles, Stanford University emerging as the most prolific institution (8 articles). The top seven journals contributed to 31% of the T100, and Journal of Medical Internet Research published the largest share (8 articles) in 70 peer-reviewed journals. The most-cited article is “Evolutionary-scale prediction of atomic-level protein structure with a language model” (Lin et al., Science 2023; 1057 citations). The research themes centered on evaluating LLMs’ performance in exam-style evaluations, medical knowledge synthesis, and question-answering tasks in medicine.
Conclusion
This analysis provides a core overview of high-impact LLMs research in medicine, guiding future applications. The findings highlighted the remarkable progress in clinical decision support, drug discovery, multimodal medical imaging analysis, and personalized medical information-answering. They also stress the need for prospective trials to assess real-world clinical impacts, boost the reliability of LLMs-generated medical info, develop consensus-driven solutions to address ethical challenges, and launch global initiatives to democratize LLMs tools.
Introduction
The advent of large language models (LLMs) has precipitated a transformative shift in artificial intelligence (AI) applications across scientific disciplines. These models undergo pretraining on extensive corpora of textual data, endowing them with the capability to both produce and comprehend natural language text. Since the release of ChatGPT in November 2022, LLMs have demonstrated unprecedented capabilities in medical knowledge synthesis, clinical decision support, and patient–physician communication optimization. 1 Their integration into healthcare systems has sparked extensive scholarly discourse, evidenced by an exponential growth in publications indexed in database. 2 Subsequently, in March 2023, GPT-4 was introduced, offering enhanced language comprehension and generation capabilities. In May 2024, GPT-4o was released, supporting inputs of text, audio, and images, promoting more natural human–machine interaction. Moreover, the rapid development of domain-specific generative transformer language models has been burgeoning.3,4 However, the rapid proliferation of research outputs necessitates systematic evaluations to identify knowledge clusters, intellectual trajectories, and evidence gaps within this field.
Citation analysis serves as a cornerstone of bibliometric research, providing quantitative insights into academic impact and disciplinary evolution. 5 Prior studies have utilized this methodology to map landmark contributions in ChatGPT-related medical domains.6,7 However, there has been no thorough bibliometric analysis on impactful LLMs research in medicine, which is a significant gap considering these models’ ethical issues and potential for practical application. This gap hinders informed resource allocation in research and obscures our grasp of LLMs integration into medical knowledge. This study aims to conduct the first bibliometric analysis of the top 100 (T100) most-cited articles of LLMs applications in medicine.
Methods
Data collection
To ensure reproducibility, this study follow the Preliminary guideline for reporting bibliometric reviews of the biomedical literature (BIBLIO) 8 (Checklist in the Supplementary). The Science Citation Index Expanded of Web of Science (WOS) was queried using a targeted search term to capture the full of LLMs-related medical research, TS = (“ChatGPT” OR “GPT” OR “Generative pre-trained transformer” OR “Generative artificial intelligence” OR “Generative AI” OR “Gemini” OR “Bard” OR “Claude” OR “Copilot” OR “LLAMA” OR “Deep-seek” OR “Chatbot*” OR “large language model*” OR “LLM”). The selection of these terms was based on a careful consideration of the current landscape of LLMs in medicine.9,10 GPT, Generative AI (GAI), and its variants represent the core architecture of most LLMs in medicine; Gemini, Bard, Claude, and others are major commercial systems with medical applications. Gemini is demonstrating remarkable capabilities in multimodal reasoning. Bard and Deep-seek showing potential in medical text generation tasks. Claude and Copilot, as sophisticated LLMs, excel at handling complex medical inquiries and providing detailed explanations. Chatbot is more simpler, patient-facing tools designed to provide straightforward healthcare advice and support.
The timeframe for literature selection was set from 1 November 2022 to 11 February 2025. This period was chosen based on two critical considerations. (a) The release of ChatGPT in November 2022 marked a significant turning point in the application of medical LLMs. This period captures the complete innovation cycle from initial prototype deployment to clinical validation phases. (b) Preliminary analysis revealed that 92% of LLMs-related medical research outputs have occurred post-2022. Restricting to recent publications ensures capturing the rapidly evolving data.
Data preprocessing
Articles identified in this original search were then manually reviewed (ZQL and XRG) and filtered for the following criteria: (a) publications focus on medicine within LLMs; (b) document types only include original articles and reviews; (c) the T100 most-cited articles were purely based on total citation count. The literature search and screening process were shown in the flowchart (Figure 1). The H5-index of journals is primarily obtained through Google Scholar (https://scholar.google.com), while the H-index and G-index of authors are typically sourced from the WOS. The H5-index reflects the citation impact of a journal's articles over the past 5 years, offering insight into the journal's recent influence, while the H-index and G-index of authors provide a measure of their productivity and the impact of their work over their entire career. Two reviewers evaluated each article, with a third reviewer resolving any discrepancies.

The flowchart of the top 100 most-cited articles on large language models.
Statistical analysis
Statistical analysis was performed on the extracted data, which included article type, title, publication year, citation count, authors, journal, and research focus areas. The data were analyzed and visualized using CiteSpace (V6.3.R1, Drexel University, PA, USA), the bibliometrix and ggplot2 package in R software version 4.4.2, and Microsoft Excel (Microsoft Corp, WA, USA).
For the statistical analysis, bivariate correlation analyses were conducted using appropriate coefficients based on the types of variables. Specifically, Spearman's rank-order correlation coefficient was used to assess the associations between continuous variables that did not meet normality assumptions or with ordinal variables. Eta-squared coefficient was applied to measure the effect size between nominal categorical variables and continuous variables. The Chi-square test of independence was used to evaluate the associations between nominal categorical variables. A correlation heatmap was generated to illustrate correlations among various variables and citation counts. These visualizations facilitated the identification of patterns and connections between elements such as the impact factors (IFs), research type, and citation. All statistical analyses were carried out using IBM SPSS Statistics (version 26.0), and statistical significance was determined using a two-tailed α of 0.05.
Results
Publication outputs and basic characteristics
All T100 cited articles were English publications, averaging 148.47 citations (range: 50–1057). Most (92 articles) originated in 2023, reflecting the explosive growth of research following ChatGPT's release. Table 1 shows the top 10 most-cited papers; half were review articles, while the remaining four original studies focused on “protein structure prediction,” “protein sequence generation,” “Chatbot performance,” “Healthcare,” and “clinical knowledge encoding.” The T100 most-cited “Evolutionary-scale prediction of atomic-level protein structure with a language model” (Lin et al., Science 2023; 1057 citations), demonstrates the application of LLMs to directly infer atomic-level protein structures from primary sequences, significantly accelerating high-resolution structure prediction and enabling the construction of the ESM Metagenomic Atlas. 11 This provides a powerful tool for exploring the vast diversity and functionality of natural proteins. Notably, among the top 50 highly cited papers, the three published in 2024 had less time to accumulate citations compared to those from 2023. However, in 2025, these 2024 papers showed a rapid citation increase, significantly surpassing the citation counts of papers at the same ranking positions from 2023. The 2024 papers focused on ChatGPT performance on simplified radiology reports, scGPT for single-cell multi-omics, a taxonomy and systematic review of ChatGPT in healthcare.4,12,13 In contrast, in the bottom 50 ranked papers, those published in 2024 demonstrated a stronger citation acceleration than their 2023 counterparts (eTable 2 in the Supplementary).
Characteristics of the top 10 most-cited articles on large language models in medicine.
Note: NEJM, New England Journal of Medicine. All publications were published in 2023 in this Table. The data extraction of citations was completed on 2 Nov 2025.
Full table is shown in eTable 2 in Supplementary.
AI: artificial intelligence.
Across the entire data, original research predominated (60%), followed by reviews 39%, and editorial 1% (eTable 2 in the Supplementary). Within this category of original research, all studies employed an observational cross-sectional design to assess ChatGPT performance, compare LLMs, or focus on the development of specific LLMs models. However, these can be more accurately categorized as diagnostic test accuracy (n = 51), Case study (n = 5), Qualitative research (n = 3), Cross-sectional questionnaire survey (n = 1). This distribution highlights the field's emphasis on empirical validation, reflecting a dynamic and proactive approach towards exploring and confirming the practical utility of AI in healthcare environments. The emphasis on empirical validation is crucial for enhancing the reliability and efficacy of AI applications in medical practice.
Country, institution, and author analysis
The United States (U.S.) led global contributions with 56 articles (56%), followed by the UK (8%), Canada (7%), China (7%), Australia (6%), and Germany (6%) (Figure 2(A)). Notably, the U.S, UK, Singapore, Italy, and China exhibited extensive international collaborative networks (Figure 3(A)). The disproportionate geographic distribution of research highlights systemic inequities in LLMs development, which may be attributed to advanced economies with robust computing infrastructure and access to high-performance LLMs are better positioned to pioneer medical AI research. In addition, countries like the U.S. have established agile regulatory frameworks for AI validation in healthcare, enabling faster clinical adoption. A total of 309 institutions contributed to these studies, with Stanford University emerging as the most prolific institutions (n = 8), followed by Vanderbilt University, Harvard University, University of California System, and University of Toronto (each with 6 articles; Figure 2(B)). The highest number of citations, on the other hand, goes to New York University, largely due to the fact that it published four highly cited papers, one of which ranked highest in terms of citations. 11 Nine institutions produced five or more articles, with collaboration networks centered on U.S. elite academic hubs (e.g. Harvard, Stanford) further emphasizes the role of resource-rich ecosystems in driving high-impact research (Figure 3(B)). Regional institutions such as Sichuan University (China), University of Toronto (Canada), and National University of Singapore also demonstrated robust collaborative engagement.

(A) Number of studies per country/regions of large language models in medicine. (B)Number of studies of institutions of large language models in medicine.

Scientific publications of large language models on medical. (A) The co-occurrence network map of different countries. (B) The co-occurrence network map of institutions. (C) Number of publications in top 10 areas of research. (D) The clustering map of keywords.
No dominant authorship network emerged suggests decentralized innovation patterns in medical LLMs research, and the top three authors Mesko, Bertalan; Klang, Eyal; and Liu, Siru contributed three articles each. Notable entries include Nigam H. Shah (H-index 60, G-index 14 at Stanford Univ, Center for Biomedical Informatics Research), Stacy Joeb (H-index 61, G-index 12, New York Univ, Med Sch, Department of Psychiatry), and Yeo, Yee Hu (H-index 69, G-index 14 at Cedars Sinai Medical Center, Karsh Div Gastroenterol & Hepatol).
Journal analysis
There are 11 journals that publish two or more medical LLMs research (Figure 4(A)), with 31% of the T100 articles concentrated in the top 7 peer-reviewed journals. The Journal of Medical Internet Research published the largest share (8 articles; IF: 5.8, H5 = 160), followed by Radiology (6 articles; IF: 12.1, H5 = 140) and npj Digital Medicine (5 articles; IF: 12.4, H5 = 109), while high-impact journals like NEJM (IF,96.2) and Lancet (IF,168.9) contributed fewer but influential studies.14,15 75% of the T100 papers appeared in Q1 journals.

(A) Journal contribution rankings by publication volume and impact metrics. (B)Word cloud map of keyword co-occurrence analysis.
Keywords and Keywords clustering
The most frequent keywords were “AI (n = 46),” “LLMs (n = 25),” and “natural language processing (n = 7)” (Figure 4(B)). Keyword cloud visualization highlights research priorities, clinical decision support, patient communication, and multimodal data integration dominate. Technical terms (transformer architectures, natural language processing) intertwine with ethical concerns (medical bias, ethical challenges).
Keywords clustering, which identifies shared themes or concepts, effectively organizes keywords, outlining the main research directions and key points of interest in the literature. This study identified 10 major research clusters (modularity Q > 0.8; silhouette score >0.9), with different colors representing different thematic clusters, confirming robust thematic cohesion and reliability of the clustering results (Figure 3(C)). The largest cluster, labeled #0 LLMs, #2 Conversational Agent, #4 GAI, and #5 AI, pertain to LLMs and are predominantly connected to subjects such as AI development, algorithms, machine learning, learning models, and ChatGPT. Other notable clusters included: #1 which centers on “Patient Privacy” and concerns related to patient information, #3 focusing on “Clinical Decision Support,” with ties to clinical decision-making and best practices, #6 Equity related to terms associated with low- and middle-income countries (LMICs), global health, and public health, #7 Medical education mainly related to education and examination, continuing medical education and reshaping medical education, #8 Oncology, with a focus on topics like lung cancer and inquiries regarding cancer. #9 Extraintestinal Manifestations prominently linked with terms such as inflammatory bowel disease, period, biomarkers, diagnosis (Figure 3(D)).
Research disciplines and topics
These publications span 38 disciplines, with Health Care Sciences Services being the leading research direction, as evidenced by 23 articles. Moreover, Medical informatics, Medicine general internal also emerge as key research directions (Figure 3(D)). Titles and Content analysis of the topics in the original study revealed that 58 studies evaluated LLMs’ performance in generating or answering medical information questions. Common clinical focus included radiology (n = 10), oncology (n = 7), and medical licensing examinations (n = 7) (Table 2). For instance, numerous studies have assessed the quality of LLMs in answering clinical knowledge questions,16,17 it offers patients valuable information, especially when they are reluctant to consult healthcare professionals or when access to medical advice is restricted. In controlled settings, the quality and even empathy of Chatbot responses were significantly higher than physician responses. However, there was a significant difference in LLMs’ clinically reported treatment in providing health advice. 18 Furthermore, we also analyzed papers with IF above 10 and those with three or more publications (eTable 1 in the Supplementary). We found that highly cited review articles mainly focus on clinical medicine, healthcare, and medical images/radiology. Original research papers, on the other hand, often assess the performance of LLMs in knowledge acquisition or question-answer tasks across specific medical fields.
Research topics of 100 top-cited papers.
Abbreviations: LLM: large language models.
Citation distribution
Figure 5(A) presents the citation distribution of the T100 cited medical LLMs papers published from 2022 to 2024, grouped by year. It shows that citation counts are highest and most variable in 2023. Monthly citation trends for these papers, with a significant surge in citations in early 2023, particularly in January and February (Figure 5(B)). Citations then gradually decline from mid-2023 to 2024, indicating fluctuating research interest and impact over time. Figure 5(C) illustrates the citation distribution across JCR quartiles (Q1 to Q4), showing higher and more variable citations in Q1 and Q2. Figure 5(D) presents the annual citation trends, revealing a significant increase in citations for Q1 papers in 2023, along with notable citations for Q2 papers. Most papers in Q3-Q4 have relatively low and stable citation counts across the 3 years.

Boxplot of total citations for large language models in medicine (published in 2022–2024). (A) Total citations of top 100 cited papers by year; (B) monthly total citations of top 100 cited papers; (C) total citations of top 100 cited papers by JCR quartile; (D) total citations of top 100 cited papers by year and JCR quartile.
Analysis of correlation
Correlation Heatmap demonstrates significant correlations among bibliometric indicators in medical LLMs research. Significance associations appear between consecutive annual citations (2023–2025) and total citations, with strongest was 2024 total citation (r = 0.949). The IF shows a moderate correlation with total citations (r = 0.364), and over time, the association between the IF and annual citation counts becomes increasingly strong. However, there is no statistically significant correlation between the JCR or research type and citation counts (Figure 6).

Correlation heatmap in large language models in medicine.
Discussion
This study characterizes and analyzes the T100 most-cited publications, which may shape the history of LLMs in medicine, showing the field's rapid evolution. These findings held significant implications for researchers and policymakers engaged in integrating LLMs into healthcare. The 14,837 citations accumulated within 28 months post-publication demonstrate unprecedented citation velocity compared to other AI/medical subfields. 7 The concentration of 92% top-cited articles in 2023 reflects the pivotal role of ChatGPT's public release (November 2022) in catalyzing a surge of LLMs research in medicine, which catalyzed rapid innovation in clinical decision support, drug discovery, and multimodal medical data analysis. This pattern aligns with the typical 6–12 months latency between technological breakthroughs and academic publication cycles. This temporal distribution also aligns with Gartner's AI maturity curve, where 2023 represents the “Peak of Inflated Expectations” phase for medical LLMs. 19 Key drivers of this growth are GAI Milestones. The launch of GPT-4 (March 2023) demonstrated LLMs’ capacity to achieve near-human performance on medical licensing exams, triggering widespread validation studies across clinical tasks. 20
Earlier studies typically accumulate more citations simply because they've had more time to be noticed and cited by researchers. This means newer articles naturally have relatively lower citation counts. To provide a more balanced view and fairly compare papers from different publication years, it's useful to rank papers by their annual citation counts and track citation changes on a yearly basis. This approach helps researchers better understand citation patterns over time. When comparing papers published in 2023 and 2024, it was found that many 2024 papers were published in high-quality journals and focused on pioneering topics, which may have accelerated their citation rates,12,13 and ultimately determined the current pattern of the T100 cited rankings. Although the correlations between 2025 citations and total citations are lower than those for 2024, this is mainly because our analysis only included citation data up to 11 February 2025. Hence, the half-lives of papers on LLMs in medicine still need to be tracked and analyzed over time.
This study reveals two transformative insights. First, accelerated knowledge production with persistent gaps. While LLMs research has achieved unprecedented citation velocity (14,837 citations in 52 months), its focus remains narrowly technical. While it has demonstrated a high level of accuracy in areas such as medical licensing exams and the accuracy of clinical expertise, there is less research addressing real-world challenges such as clinician workflow integration or patient consent protocols, which highlights a critical disconnect between technical innovation and clinical needs. Second, geographic and institutional imbalance. The concentration of high-impact research in U.S. institutions (56%) and elite universities (Stanford, Harvard) exposes systemic inequities in AI development. While elite institutions drive innovation through resource-rich ecosystems, LMICs remain underrepresented, risking the perpetuation of healthcare disparities. 14 To address these issues, several concrete proposals could be considered to encourage equitable contributions from LMICs. Firstly, international research funding bodies could establish dedicated grants and collaboration programs that specifically target researchers in LMICs, providing them with the financial resources and infrastructure needed to conduct AI research. This would help to level the playing field and enable researchers from diverse geographical backgrounds to contribute to the field. Secondly, piloting partnerships including joint research projects, academic exchanges, and mentorship programs could facilitate knowledge exchange and capacity building, allowing researchers from LMICs to access expertise and resources that might otherwise be unavailable to them. Thirdly, establishing computing resource-sharing and federated data networks to empower LMICs’ participation in model training. Lastly, investing in educational and training programs in LMICs focused on AI literacy and research methodologies would be crucial. By strengthening local capabilities in AI development, these countries can become more active participants in the global research landscape, rather than merely consumers of AI technologies developed elsewhere. This investment in human capital would have long-term benefits for the entire field of AI research.
We advocate for a judicious and ethical use of LLMs, ensuring that their integration into medical practice is both scientifically sound and beneficial to patient well-being. This analysis provided three actionable priorities for healthcare stakeholders. First, while research has evolved from foundational LLMs architectures (e.g. GPT series, BERT) to domain-specific medical adaptations, there is a striking contrast between the focus on medical licensing exams or the performance of medical questions and the limited exploration of clinical workflow integration and patient-centered outcomes. 21 Despite confirmed potential of LLMs in reducing diagnostic errors and enhancing patient communication, more rigorous trials are urgently needed to evaluate LLMs’ real-world impact on diagnostic accuracy, efficiency in clinical workflow, and clinician burnout mitigation.22,23 Second, keywords clustering and analysis of highly cited works reveal a significant emphasis on technical benchmarks, such as diagnostic accuracy and radiology report generation. However, there is limited attention to ethical and equity concerns, highlighting a concerning gap between technical advancements and real-world clinical integration. Although clusters such as #1 Patient privacy and #6 Equity were identified, fewer studies proposed actionable frameworks to mitigate biases or ensure algorithmic transparency. Given the increasing deployment of LLMs in sensitive applications, emerging challenges, including hallucination, patient data privacy and security, academic integrity, and liability determination, demand consensus-driven solutions and robust ethical and regulatory frameworks. 24 As LLMs rapidly evolve, standardized reporting guidelines are also emerging. For instance, GAMER Statement provided standardized guideline for LLMs use in medical research, covering tool specifications, roles and impacts on findings and ensuring transparency, integrity and quality of research. Similarly, the TRIPOD-LLM Statement enhances the quality, reproducibility, and clinical applicability of LLMs research in healthcare; the checklist integrates these concepts throughout, ensuring that bias and fairness are considered at every stage of the model's life cycle.25,26 Furthermore, a three-stage framework called HELP-ME has been designed to evaluate and protect privacy in healthcare-oriented LLMs. It includes ethical privacy threat assessment, prompt-focused evaluation, and ethical obfuscation to protect patient data while preserving model utility. The framework's effectiveness has been validated, highlighting its role in upholding ethical standards in clinical practice. 27 However, LLMs used in clinical decision-making still lack consensus on liability for incorrect recommendations. With no clear legal framework, it's crucial to clarify responsibility as these models are integrated into medical decision-making. Third, disparities in early access to foundational LLMs technologies risk exacerbating global healthcare inequities. Efforts to democratize LLMs tools, such as open-source initiatives exemplified by models like Deep-seek, represent critical steps toward ensuring these innovations address region-specific healthcare challenges. However, the success of such initiatives hinges on sustained funding and support for research led by LMICs. 14 This support is essential to ensure that LLMs tools not only advance technological capabilities but also reduce global disparities in healthcare access and quality. By fostering equitable access to these technologies, we can promote the development of contextually appropriate solutions that meet the unique needs of diverse populations, thereby advancing global health equity.
Limitations
The interpretation of our findings is subject to four limitations. First, despite employing a search strategy that combined broad and specific terms, some significant studies may have been inadvertently excluded due to inconsistent terminology. However, the supplementary materials help ensure transparency and reproducibility. Although limiting the search time to (November 2022–February 2025) may overlook earlier research findings, a supplementary search revealed only one predictive model study focused on the biomedical field that aligns with LLMs, 28 but it has no impact on the analysis of the trends of this study. Second, the inherent recency bias in citation-based metrics may underestimate pioneering contributions from emerging research ecosystems, particularly those from low-resource settings where dissemination delays and limited international visibility disproportionately affect citation accrual. Longitudinal tracking beyond the 2022–2025 window is essential to validate the sustained impact of such studies. Third, the minimum temporal slicing unit of CiteSpace is constrained to 1 year impedes granular analysis of knowledge trajectory inflection points in rapidly evolving domains in LLMs. For example, quarterly scale fluctuations, such as the Q2 2023 surge in LLMs-driven radiology report optimization studies following GPT-4's release,29–31 remain undetectable, potentially obscuring critical shifts in research priorities. While citation acceleration metrics offer dynamic insights and might benefit from citations per month to more fairly compare, citation data is often only recorded annually, and obtaining monthly data would require more granular tracking that is not currently feasible due to limitations in data collection and reporting practices. Future studies should continue monitoring this field, with the expectation that, as time progresses, trends in LLMs within this domain will be more effectively captured. The last is AI-related breakthrough studies that first appear on arXiv before formal WOS indexing. Future investigations should adopt a hybrid framework, initially identifying core literature via WOS, then supplementing with arXiv preprints (time-lag adjusted) and ResearchGate attention scores. This dual-layer approach balances rigor with trend sensitivity, particularly for fast-evolving domains LLMs.
Conclusion
The study reveals the trend, hotpots, and critical gaps in the T100 most-cited LLMs studies in medicine. Current research focuses on technical validation, like diagnostic accuracy and protein structure prediction. But underexplored areas such as real-world surgical assistance systems and rare disease diagnostics need urgent attention. To bridge the current “bench-to-bedside” translation gap, three priorities emerge (Figure 7): First, we need to standardize evaluation protocols across clinical specialties, focusing on real-world impact metrics beyond just accuracy. Second, the dominance of US institutions shows systemic inequities. We need global resource-sharing platforms to democratize LLMs development, especially for low-resource regions. Third, we must foster interdisciplinary collaborations to turn technical advances into clinically useful tools. Future research must balance innovation with ethical considerations. This means ensuring LLMs enhance both medical knowledge and healthcare equity. By aligning citation trends with unmet clinical needs, the field can shift from exam-centric benchmarks to ethically sound, patient-centered AI solutions.

The urgent gaps of large language models in medicine need to action.
Supplemental Material
sj-docx-1-dhj-10.1177_20552076251365059 - Supplemental material for The top 100 most-cited articles on large language models in medicine: A bibliometric analysis
Supplemental material, sj-docx-1-dhj-10.1177_20552076251365059 for The top 100 most-cited articles on large language models in medicine: A bibliometric analysis by Zhi-Qiang Li, Runbing Xu, Xin-Ran Gong, Cheng-Lu Wang and Jian-Ping Liu in DIGITAL HEALTH
Supplemental Material
sj-docx-2-dhj-10.1177_20552076251365059 - Supplemental material for The top 100 most-cited articles on large language models in medicine: A bibliometric analysis
Supplemental material, sj-docx-2-dhj-10.1177_20552076251365059 for The top 100 most-cited articles on large language models in medicine: A bibliometric analysis by Zhi-Qiang Li, Runbing Xu, Xin-Ran Gong, Cheng-Lu Wang and Jian-Ping Liu in DIGITAL HEALTH
Footnotes
Acknowledgements
We would thank all of the global researchers who have contributed to the healthcare research field of LLMs. During the revised of the manuscript, the authors used Deep-seek and Kimi to correct typographical and grammatical errors. No generative language models were employed in the ideation or writing process. After using these tools, the authors reviewed and edited the content as necessary and take full responsibility for the content presented.
Ethical approval
There is no need for ethics committee approval or consent to participate, as all data used in this bibliometric analysis were sourced from the WOS and did not involve data from human or animal subjects.
Author contributions
ZQL, RBX, XRG, CLW, and JPL wrote and revised the text. All authors final approval of manuscript.
Funding
The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by the High-level traditional Chinese medicine key subject construction project of National Administration of Traditional Chinese Medicine--Evidence-based Traditional Chinese Medicine, (grant number 90010951310169.).
Declaration of conflicting interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Data availability
Supplemental material
Supplemental material for this article is available online.
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
