Sage Journals: Discover world-class research

Abstract

Objectives

This study evaluates artificial intelligence (AI) reasoning capabilities in gynecologic cancer genetic counseling, comparing the performance of ChatGPT and DeepSeek models to guide patient-centered AI implementation in clinical genetics.

Methods

Using 40 National Comprehensive Cancer Network-aligned counseling scenarios, we conducted blinded dual-oncologist evaluations of two large language models. Methodological rigor included model anonymization, a pre-calibrated scoring framework, and validated metrics (Global Quality Scale and Patient Education Materials Assessment Tool) assessing informational coherence, understandability, and actionability.

Results

DeepSeek demonstrated superior informational breadth (mean character difference: −609.0, p < .0001) and visual communication (diagram integration, p < .01), with 49-fold greater probability in recommending clear and actionable actions (p < .01, OR = 49.0). ChatGPT excelled in concise summarization (22% faster response generation, p = .013).

Conclusion

Strategic AI model selection—leveraging DeepSeek's visually-rich, structured educational approach for complex information, and ChatGPT's concise, rapid summarization for efficient communication—enhances patient-centered genetic education when combined with clinician oversight. This framework supports healthcare's digital transformation by optimizing human-AI collaboration in hereditary cancer care.

Keywords

Artificial intelligence genetic counseling gynecologic cancer patient-centered care digital health transformation

Introduction

As AI technology moves rapidly into oncology, we will need to find the right mixture between medical service and algorithmic explainability, especially when it comes to challenging fields like genetic counseling. AI chatbots offer unprecedented opportunities to convey risk for breast cancer.¹ However, before employing them in patient counseling in gynecologic cancer, which implicates complex hereditary risk assessments in decision making, we must understand the meaning of and confidence in their reasons. The contemporary focus of AI research in oncology is diagnostic accuracy, but not comprehensibility. The relevance of such divergence becomes particularly relevant in genetic counseling, as the success of genetic counseling lies on the balance among statistical accuracy and educational flexibility - properties not accurately summarized by standard measures.² Here, we hope to shed some light toward the issue by proposing two distinct evaluation methods to compare ChatGPT and DeepSeek on the gynecologic cancer counseling application. In the process, we focused on the unique information rhythms exhibited by each model and explored whether this affects their applicability in different counseling contexts.

Method

We leveraged Patel et al.'s approach of translating genetic counseling guidelines for gynecologic cancers to accessible content that was accompanied with 40 guideline-matched questions and two oncologists’ blind assessments.³ According to the article, these scenarios were developed by a team of gynecologic oncology specialists with reference to authoritative professional association websites, including those of the American College of Obstetricians and Gynecologists, Cancer Care, and the National Cancer Institute. The scenarios are derived from common counseling situations encountered in clinical practice. The 40 guideline-matched questions were reviewed and validated by an expert panel, all of whom are board-certified gynecologic oncologists affiliated with academic institutions. The 40 scenarios are divided into two categories: general genetic counseling questions (n = 20) and syndrome-specific questions, including those related to Lynch syndrome (n = 3) and hereditary breast and ovarian cancer syndrome (n = 17). Lynch syndrome is associated with endometrial cancer, while hereditary breast and ovarian cancer syndrome is linked to ovarian cancer. Table S1 in the Supplementary materials shows the results of ChatGPT and DeepSeek's responses to two of these questions

To compare, we used the following model versions: ChatGPT: gpt-3.5-turbo-0125 (April 2, 2025) and DeepSeek: DeepSeek-R1 version (April 2, 2025). Both models were accessed via their respective web interfaces on the same Windows computer. No additional prompt engineering or special configurations (such as custom temperature settings) were applied during the comparisons.

Outcomes were assessed using the Global Quality Scale (GQS, 1–5 Likert) for informational coherence and a modified PEMAT tool evaluating PEMAT-understandability (PEMAT-U) and PEMAT-actionability (PEMAT-A).^4,5 Each PEMAT item was assigned a numerical value: “yes” = 1, “no” = 0, and “N/A” was treated as missing and excluded from calculations. PEMAT results are typically reported as percentage scores reflecting understandability and actionability, calculated as follows:

Understandability Dimension:

Total Possible Score = Total number of understandability items − number of items marked “N/A”

Actual Score = Sum of all items marked “Yes” (i.e. 1)

Understandability Score = (Actual Score / Total Possible Score) × 100

Actionability Dimension:

Total Possible Score = Total number of actionability items − number of items marked “N/A”

Actual Score = Sum of all items marked “Yes” (i.e. 1)

Actionability Score = (Actual Score / Total Possible Score) × 100

Based on the scoring results described above, the PEMATA/U percentage scores were categorized into three levels: high: 70%–100%; moderate: 40%–69%; low: 0%–39%.

We employed two methods to ensure quality control: (1) random encoding to model anonymization to prevent rater prejudice; and (2) scoring was initially performed independently by two experienced gynecologic oncologists who received specific training in the application of the PEMAT and GQS scoring criteria prior to the formal evaluation. In cases where the difference between their ratings was differences of two points or more for any item, the discrepancies were resolved by referring to the relevant National Comprehensive Cancer Network (NCCN) Guidelines. A third expert was available for arbitration if consensus could not be reached through guideline review, though this was not required in the current study.⁶ High inter-rater reliability (Table S2 in the supplementary materials) was observed. This dual-metric approach objectively quantified AI performance in clinical knowledge dissemination while maintaining methodological parity with prior surgical oncology AI studies. The reporting of this study conforms to STROBE guidelines (STARD-2015).

Results

The analysis revealed marked differences in content generation between ChatGPT and DeepSeek (Figure 1). DeepSeek produced significantly longer responses, with a mean character count difference of −609.0 (95% CI: −734.3 to −483.7; p < .0001) and a large effect size (η² = 0.713), indicating superior informational breadth. While GQS scores trended toward significance (p = .0613), DeepSeek demonstrated clinically meaningful advantages in patient education metrics. It outperformed ChatGPT in both PEMAT-U (median difference: −0.077; p < .0001) and PEMAT-A (median difference: −0.167; p < .0001), reflecting enhanced capacity to deliver digestible, actionable guidance.

Figure 1.

Comparison between DeepSeek and ChatGPT: (a) character count (p< .01); (b) GQS score (p = .613, ns); (c) understandability (PEMAT-U) and actionability (PEMAT-A) scores. Statistical analysis was performed using the Wilcoxon test. ***p < .001; ns: non-significant, p > .05.

According to the PEMAT scale, Chatgpt and DeepSeek's answer preferences are shown in Table S3: DeepSeek showed superior integration of diagrams to make it easier for readers to follow the suggested actions (p < .01, OR = 27.0, 95% CI = 0.002–0.622) and help people understand the content more easily (p < .01， OR = 27.0, 95% CI = 1.61–453.0). By DeepSeek, the diagrams are simple with short and clear row and column headings (p < .01, OR = 27.0, 95% CI = 1.61–453.0). Notably, DeepSeek excelled in the item about “When recommending actions, the article further decomposes the actions into clear, actionable, concrete steps” (p < .01，OR = 49.0, 95% CI = 2.98–804.0) and providing calculation examples (p < .01， OR = 39.0, 95% CI = 2.36–647.0), crucial for hereditary risk communication. When it comes to explaining specialized terms, ChatGPT performs better than Deepseek by using more accessible vocabulary to help readers understand (p < .01，OR = 0.071, 95% CI = 0.009–0.543).

Despite these differences, both models exhibited parity in foundational communication elements. Neither demonstrated significant differences in Summary article (p = .549, OR = 0.57, 95% CI = 0.17–1.95) and numerical clarity (p = .125, OR = 9.0, 95% CI = 0.48–167.0). While neither model routinely presented planning tools (p = .062, OR = 11.0, 95% CI = 0.61–199.0) or personalized action plans for special populations (p = .031, OR = 13.0, 95% CI = 0.73–231.0).

The capability of DeepSeek to contextualize probabilistic information in structured visual representations such as pedigree charts and sequential risk mitigation tactics—crucial tools to carry out genetic counseling for gynecologic cancer—are prominent in these results. While the visual cues provided by DeepSeek and the concise baseline explanations from ChatGPT are impressive, highlighting the distinct strengths of each model, their performance and efficacy in a real-world clinical environment nevertheless require further empirical evaluation.

Discussion

This study builds upon the foundational work of Patel et al. by utilizing their set of 40 genetic counseling questions in gynecologic oncology to evaluate and compare responses generated by ChatGPT and DeepSeek. While previous research has explored the use of AI in general medical contexts, our work specifically examines the applicability and performance of large language models (LLMs) in hereditary cancer counseling—a domain with distinct clinical and communicative demands. By employing validated evaluation tools such as the GQS and the PEMAT, we provide a structured and reproducible framework for assessing the quality, clarity, and actionability of AI-generated genetic information. These 40 clinical scenarios and clinical patient education flags exhibit a significant performance improvement between DeepSeek and ChatGPT, thus having complementary advantages that can be used across the care continuum.

The deep learning approach to DeepSeek's broad data compilation and rich visual reasoning capacities outperforms the quality of all preoperative guidance. Visualizing structured family history graphs and the full risk lowering measures and succeeding in providing the requisite visual and spatial health literacy for hereditary cancer instruction. With their systematic use of headings, bullet points, and integrated diagrams, created a scaffolded learning experience. This structure controls the informational pacing, reducing cognitive load by breaking down complex topics into digestible segments.

The use of the ChatGPT as a linguistic ability for postoperative processes that require rapid interpretation of BRCA/HRD results will also be required. In contrast, ChatGPT produced fluid, consolidated prose, prioritizing communicative efficiency. The robust performance for maintaining logical consistency and numerical precision ensures the robustness of communication in time-sensitive meetings with clinicians. This conciseness projects a persona of authority and efficiency but may place a higher cognitive burden on the patient to distill key actions from dense paragraphs.

The proposed hybrid framework does not seek to resolve this paradox but rather to orchestrate it, strategically deploying each model's strength according to the communicative demands of the clinical scenario. Our proposed digital transformation will balance high diagnostic or counseling accuracy with operational efficiency in genetic counseling using three different processes simultaneously⁷: (1) use phase-based models that integrate DeepSeek's preoperative decision support system with the postoperative summarization provided by ChatGPT to simplify the clinical workflows for both the preoperative decision making and postoperative care; (2) using a three-dimensional approach, we can improve health literacy. DeepSeek's figures support comprehension, and ChatGPT's crisp summaries support recall; (3) human-AI teamwork redefines doctor-nurse functions by permitting frontline health care workers to formulate culturally specific care plans with AI-provided information, with the models handling the big-data-driven aspects, such as the prior probabilities.

This tripartite integration operationalizes patient-centered care principles in the digital age, which are aligned with the clinical and socially responsive interpretation, therefore strikes a balance between genetic precision and operational realities while conforming to AI moral principles for the hereditary cancer setting.⁸

Implications and future directions

These models inherently project different AI personas, which may influence patient perception and trust. DeepSeek's methodical and visual output embodies the persona of a dedicated educator, investing time to ensure comprehension. Conversely, ChatGPT's succinct style mirrors that of an efficient consultant, providing direct answers. The perceived trustworthiness and likability of these personas are critical yet unmeasured variables in our study. Future research should assess patient reception to these styles, as a model's clinical utility is contingent not only on its accuracy but also on its ability to build rapport and be perceived as credible and empathetic.

AI use involves restructuring clinical workflow, and adopting AI into the systems for a specific phase of care. The use of PEMAT standards for validation of AI-based health literacy should become a regulatory requirement to legitimize such tools. While our quantitative assessment of actionability and understandability using standardized scales is robust, it does not fully capture pragmatic competence—the ability to use language appropriately in social contexts. Genetic counseling is a dialogic process built on empathy, deference, and mutual understanding. A qualitative review of responses, as illustrated in the Supplementary Materials, reveals nuances in how the models handle critical pragmatic acts. For instance (Table S1 in the supplementary materials), in responding to the question “Who needs to get genetic testing?,” both models appropriately deferred final authority to healthcare professionals. However, their pragmatic strategies differed in clarity and force. ChatGPT provided a clear, directive statement: “Important: Genetic testing should ideally be done with guidance from a genetic counselor….” In contrast, DeepSeek employed a more neutral and somewhat vague formulation: “genetic testing is a personal choice.” While this respects patient autonomy, its lack of direct guidance in a high-stakes medical context may diminish its actionability. Therefore, we posit that the future development of AI counselors must include benchmarks for such pragmatic and relational competencies, ensuring they can not only inform but also communicate with appropriate nuance, clarity, and support. We encourage more research focus on establishing hybrid models, multicenter trials for demonstrating patients’ long-term results, and establish shared AI competencies certificate among different genetic counseling domains.

This study has not yet explored the performance of ChatGPT and DeepSeek in real-world clinical genetic counseling with patients. In fact the regulatory approval process for applying AI in real-world clinical genetic counseling still requires several key steps: (1) Validation and Performance Evaluation: Demonstrating robust performance through clinical validation studies that assess safety, efficacy, usability, and equity across diverse patient populations,⁹ (2) Explainability and Transparency: Ensuring that AI-generated recommendations are interpretable and that limitations are clearly communicated to clinicians and patients, (3) Integration and Workflow Compatibility: Proving that the tool complements—rather than disrupts—existing clinical workflows and that it provides actionable, not just informative, output, (4) Continuous Monitoring and Updates: Establishing mechanisms for post-deployment monitoring, error reporting, and iterative model updates in response to new evidence or guidelines (e.g. NCCN updates). Although there is no AI-specific regulatory pathway for genetic counseling yet, general software as a medical device (SaMD) principles—such as those in the Food and Drug Administration’s Digital Health Precertification Program—could be adapted. Close collaboration among developers, clinicians, regulators, and patients is essential throughout this process to establish standards that ensure reliability and ethical use in real-world practice.¹⁰

Limitations

We note that three limitations need to be addressed. First, the scoring system and evaluation framework in this study were based solely on assessments by board-certified gynecologic oncologists, who rated the responses generated by ChatGPT and DeepSeek. No direct input from patients or end-users was incorporated into the scoring process, nor was the model responses evaluated in real-world clinical consultations or by patients themselves. This represents a limitation of the current study, as it does not capture the perspectives, comprehension, or preferences of actual patients. Our simulation ran in line with NCCN guidelines, but in real life, there might exist unforseen factors in establishing rapport, a crucial aspect of patient-centered care, not covered (fully) by existing AI metrics. Comparison with the direct patient input, the assessment of cultural biases has been done according to NLP tools, what may cause underestimation of sociocultural complexity. Moreover, 20 out of the 40 gynecologic oncology genetic counseling questions used in this study were general genetic counseling questions, but the generalizability of the findings to other hereditary cancer disorders cannot yet be assured. So, to further establish generalizability to other hereditary cancer types and broader clinical genetics applications, it would be necessary to develop and validate a dedicated set of counseling questions specific to those contexts and reassess the performance and clinical utility of these AI models.

Conclusion

In the present study, through the clever use of DeepSeek's visual reasoning capabilities along with ChatGPT's final sentence summarization capability we were able to develop a more robust patient-focused genetic counseling assistant. Concurrent to sophisticated information and critical clinical necessity lies the high-risk domain of genetic cancer management where hybrid frameworks should balance the critical requirements for digital evolution arising from a need to attain both meticulous explanations and functional performance. AI-supported health care may not address all concerns but providing a proven strategy to broaden clinical workflows without sacrificing transparency is a vital step forward. Potential applications will soon drive AI model selection according to specific situations, and the search for cultural adaptation strategies to improve fairness—an essential next task for AI applications in low- and middle-income countries.

Supplemental Material

sj-docx-1-sci-10.1177_00368504251412703 - Supplemental material for AI-driven patient-centered care: A digital transformation framework for gynecologic cancer genetic counseling

Supplemental material, sj-docx-1-sci-10.1177_00368504251412703 for AI-driven patient-centered care: A digital transformation framework for gynecologic cancer genetic counseling by Ruiye Yang, Xiaoran Zheng, Yaoqi Deng, Mengqi Deng, Junyi Jiang and Jinwei Miao in Science Progress

Supplemental Material

sj-docx-2-sci-10.1177_00368504251412703 - Supplemental material for AI-driven patient-centered care: A digital transformation framework for gynecologic cancer genetic counseling

Supplemental material, sj-docx-2-sci-10.1177_00368504251412703 for AI-driven patient-centered care: A digital transformation framework for gynecologic cancer genetic counseling by Ruiye Yang, Xiaoran Zheng, Yaoqi Deng, Mengqi Deng, Junyi Jiang and Jinwei Miao in Science Progress

Supplemental Material

sj-docx-3-sci-10.1177_00368504251412703 - Supplemental material for AI-driven patient-centered care: A digital transformation framework for gynecologic cancer genetic counseling

Supplemental material, sj-docx-3-sci-10.1177_00368504251412703 for AI-driven patient-centered care: A digital transformation framework for gynecologic cancer genetic counseling by Ruiye Yang, Xiaoran Zheng, Yaoqi Deng, Mengqi Deng, Junyi Jiang and Jinwei Miao in Science Progress

Footnotes

Acknowledgments

This study was conducted in accordance with the ethical standards outlined in the Declaration of Helsinki, as revised in 2024. After review by the institutional ethics committee (Beijing Obstetrics and Gynecology Hospital Capital Medical University), it was determined that this research, which involved the analysis of publicly available AI model outputs and did not involve human participants, personal data, or clinical interventions, was exempt from further ethical approval requirements. The authors gratefully thank the editors and reviewers for their constructive suggestions to improve this manuscript.

ORCID iDs

Ruiye Yang

Mengqi Deng

Jinwei Miao

Author contributors

Ruiye Yang and Xiaoran Zheng performed the majority of experiments, including analyzed and interpreted the data and drafted the manuscript. Yaoqi Deng and Mengqi Deng verified the underlying data. Junyi Jiang and Jinwei Miao reviewed the article and assumed the responsibility of corresponding authors. All authors read and approved the final manuscript.

Funding

The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This study was funded by Capital Medical University Laboratory for Clinical Medicine and Gynecological Tumor Precise Diagnosis and Treatment Innovation Studio, Laboratory for Clinical Medicine, Capital Medical Universit and Beijing Natural Science Foundation (Grant number: L2510025).

Declaration of conflicting interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Data availability statement

All data generated and/or analyzed during this study are included in this published article.

Supplemental material

Supplemental material for this article is available online.

References

Vyas

Kumar

Sharma

, et al. Advancing the frontier of artificial intelligence on emerging technologies to redefine cancer diagnosis and care. Comput Biol Med 2025; 191: 110178.

Cook

Baty

Dent

, et al. Defining orienting language in the genetic counseling process. J Genet Couns 2023; 32: 685–697.

Patel

Hermann

Growdon

, et al. ChatGPT accurately performs genetic counseling for gynecologic cancers. Gynecol Oncol 2024; 183: 115–119.

Zhang

Yang

Shen

, et al. Quality of online video resources concerning patient education for neck pain: a YouTube-based quality-control study. Front Public Health 2022; 10: 972348.

Shan

Dong

, et al. The Chinese version of the patient education materials assessment tool for PRIN materials: translation, adaptation, and validation study. J Med Internet Res 2023; 25: e39808.

Haider

, et al. Evaluating large language model (LLM) performance on established breast classification systems. Diagnostics (Basel) 2024; 14: 1491.

Yadav

Verma

Kumar

, et al. The perspectives of biomarker-based electrochemical immunosensors, artificial intelligence and the internet of medical things toward COVID-19 diagnosis and management. Mater Today Chem 2021; 20: 100443.

Al Knawy

McKillop

Abduljawad

, et al. Successfully implementing digital health to ensure future global health security during pandemics: a consensus statement. JAMA Netw Open 2022; 5: e220214.

Chauhan

Yadav

Bhatia

, et al. Global impact of vitamin D deficiency and innovative biosensing technologies. Chem Eng J 2025; 506: 159790.

10.

Tun

SYY

Madanian

Mirza

. Internet of things (IoT) applications for elderly care: a reflective review. Aging Clin Exp Res 2021; 33: 855–867.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.07 MB

0.01 MB

0.02 MB