Abstract
Introduction
This study aims to critically assess the appropriateness and limitations of two prominent large language models (LLMs), enhanced representation through knowledge integration (ERNIE Bot) and chat generative pre-trained transformer (ChatGPT), in answering questions about liver cancer interventional radiology. Through a comparative analysis, the performance of these models will be evaluated based on their responses to questions about transarterial chemoembolization and hepatic arterial infusion chemotherapy in both English and Chinese contexts.
Methods
A total of 38 questions were developed to cover a range of topics related to transarterial chemoembolization (TACE) and hepatic arterial infusion chemotherapy (HAIC), including foundational knowledge, patient education, and treatment and care. The responses generated by ERNIE Bot and ChatGPT were rigorously evaluated by 10 professionals in liver cancer interventional radiology. The final score was determined by one seasoned clinical expert. Each response was rated on a five-point Likert scale, facilitating a quantitative analysis of the accuracy and comprehensiveness of the information provided by each language model.
Results
ERNIE Bot is superior to ChatGPT in the Chinese context (ERNIE Bot: 5, 89.47%; 4, 10.53%; 3, 0%; 2, 0%; 1, 0% vs ChatGPT: 5, 57.89%; 4, 5.27%; 3, 34.21%; 2, 2.63%; 1, 0%; P = 0.001). However, ChatGPT outperformed ERNIE Bot in the English context (ERNIE Bot: 5, 73.68%; 4, 2.63%; 3, 13.16; 2, 10.53%;1, 0% vs ChatGPT: 5, 92.11%; 4, 2.63%; 3, 5.26%; 2, 0%; 1, 0%; P = 0.026).
Conclusions
This study preliminarily demonstrated that ERNIE Bot and ChatGPT effectively address questions related to liver cancer interventional radiology. However, their performance varied by language: ChatGPT excelled in English contexts, while ERNIE Bot performed better in Chinese. We found that choosing the appropriate LLMs is beneficial for patients in obtaining more accurate treatment information. Both models require manual review to ensure accuracy and reliability in practical use.
Keywords
Introduction
Liver cancer is among the most common malignancies globally. According to the 2022 GLOBOCAN database, it accounts for over one million deaths worldwide, making it the 3rd leading cause of cancer mortality after lung and colorectal cancers. Additionally, liver cancer is the sixth most frequently diagnosed cancer worldwide, representing a substantial disease burden and presenting significant challenges in treatment. 1 Among various treatment modalities for liver cancer—such as surgical resection, radiotherapy, and chemotherapy—interventional therapy holds a crucial role, offering unique advantages over other approaches. Transcatheter arterial chemoembolization (TACE) and hepatic artery infusion chemotherapy (HAIC) are two widely utilized interventional techniques that contribute significantly to the clinical management of liver cancer. These therapies directly target the tumor while minimizing damage to surrounding healthy tissue, potentially prolonging survival and improving patients’ quality of life. The unique benefits of interventional therapies underscore their importance within the comprehensive liver cancer treatment landscape.
However, due to the complexity and specialized nature of medical knowledge, it is often difficult for patients and their caregivers to accurately understand and grasp the issues related to HAIC and TACE. The accuracy and reliability of online content vary widely. Recently, with rapid advancements in artificial intelligence (AI), tools like ChatGPT, developed by OpenAI (San Francisco, USA), and ERNIE Bot (Enhanced Representation through Knowledge Integration), created by Baidu (Beijing, China) have offered new solutions to address these challenges. These models have powerful natural language processing capabilities and can generate coherent, logically consistent responses covering a wide range of knowledge domains. Studies have demonstrated ChatGPT's potential in medical contexts: in medical education, it has proven capable of answering exam-level questions, supporting learners as an interactive educational tool2,3; in clinical practice, ChatGPT has shown promise in enhancing the logic of clinical decision support systems, providing valuable recommendations that aid clinicians. 4 Additionally, recent studies have highlighted its potential to provide medical information tailored to specific patient groups. For instance, Gravina et al. 5 investigated its effectiveness in addressing questions related to inflammatory bowel disease (IBD), while Yeo et al. 6 explored its application in responding to queries about liver disease; in health management, ChatGPT serves as a virtual health coach for chronic disease management, offering guidance to patients to enhance health literacy and encourage positive behavior change. 7 However, most studies assessing ChatGPT's role in medicine have been conducted in English-speaking contexts, with limited exploration of its potential in other languages, such as Chinese, which is among the most widely spoken globally. Similar to ChatGPT, ERNIE Bot is a new generation of knowledge-enhanced big language models from Baidu in China. Nevertheless, few studies have reported its capabilities in medical consulting.
This study aims to evaluate the abilities of ChatGPT and ERNIE Bot to address questions related to TACE and HAIC interventions for liver cancer in both Chinese and English contexts, by examining the accuracy and comprehensiveness of each model in responding to specialized queries. Through this research, we hope to elucidate the potential value and limitations of these LLMs in the healthcare area and determine if they can be an effective tool for providing quality treatment information and education.
Materials and methods
Question design
The researchers initially developed a set of questions for TACE and HAIC by drawing from existing clinical treatment guidelines and addressing real-world challenges patients encounter in clinical practice.8,9 In a subsequent screening phase, the research team carefully removed redundant, ambiguous, or overly subjective questions, ultimately refining a set of relevant and practical questions (Figure 1). This final question set encompassed foundational knowledge, patient education, treatment, and care. This approach ensured a thorough assessment and enhanced the practical applicability of the findings from various perspectives.

TACE and HAIC question selection flowchart. Frequently asked questions about the knowledge and management of TACE and HAIC are derived from existing clinical treatment guidelines and patients’ questions in actual treatment.
Assessment process
We entered questions into the May 2024 version of ChatGPT 4.0 and ERNIE Bot 4.0. A question was entered in English and Chinese, both done independently using the “New Chat” function. Example prompts included: “Please answer the following questions about liver cancer.” To evaluate the accuracy and comprehensiveness of the models’ answers regarding liver cancer interventional radiology, an expert panel consisting of 10 professionals with extensive clinical experience in liver cancer interventions was assembled (years of interventional radiology experience: 5–10: 4/10, 40%; 10–20: 4/10, 40%; > 20: 2/10, 20%). Each panel member independently rated the responses from both models using a five-point Likert scale. Differences in ratings were addressed through moderated discussions among the panel members. The final score for each response was then determined by a seasoned clinical expert, who has over 20 years of interventional radiology experience, ensuring a consistent and rigorous evaluation process.
Evaluation of classification
Statistical analysis
Count data were summarized using frequencies and percentages, and statistical analyses were conducted using the Wilcoxon signed-rank test to compare the accuracy and comprehensiveness of responses between ChatGPT and ERNIE Bot. All statistical analyses were performed with SPSS version 25.0, with a two-sided P-value of less than .05 indicating statistical significance. Figures were generated using GraphPad Prism version 9.5. As this study did not involve patient data or sensitive personal information, ethics committee approval was not required (Table 1).
Rating of ERNIE Bot and ChatGPT responses using a five-point Likert scale.
Results
A comparative analysis of ERNIE Bot and ChatGPT across Chinese and English contexts
Based on the data illustrated in Figure 2 and Table 2, both models demonstrate high performance across 38 core questions related to TACE and HAIC in both Chinese and English contexts, with over half of the responses receiving the top score of 5. However, it was revealed that ERNIE Bot outperformed ChatGPT in the Chinese context regarding overall response quality (Table 2, P = 0.001). In contrast, in the English context, ChatGPT surpasses ERNIE Bot (Table 2, P = 0.026). Detailed information is listed in the Supplemental material.

A bar chart illustrating the ratings of responses provided by ERNIE Bot and ChatGPT, evaluated on a five-point Likert scale in both Chinese and English contexts. *: P < 0.05; ns: no statistical difference. The horizontal axis scale (1–100) represents the percentage distribution of scores across different levels on the five-point Likert scale.
An evaluation of responses ERNIE Bot and ChatGPT provided to questions about TACE and HAIC in both Chinese and English contexts.
Subgroup analysis in the Chinese context
In an in-depth comparative analysis within the Chinese context, we identified notable differences in the performance of ERNIE Bot and ChatGPT in medical information Q&A (Table 3 and Figure 3). Specifically, when we refined our evaluation to include distinct dimensions—such as fundamental knowledge, patient education, treatment and care—and developed targeted questions for each category, ERNIE Bot demonstrated a clear advantage over ChatGPT. It was demonstrated that ERNIE Bot's responses were of significantly higher quality than those of ChatGPT (fundamental knowledge, P = 0.023; treatment and care, P = 0.021, Table 3). In contrast, in the area of patient education, the differences were not significant.

A bar chart illustrating the ratings of responses related to foundational knowledge, patient education, and treatment and care-related questions, evaluated using a five-point Likert scale in the Chinese context. *: P < 0.05; ns: no statistical difference. The horizontal axis scale (1–100) represents the percentage distribution of scores across different levels on the five-point Likert scale.
The ratings of responses provided by ERNIE Bot and ChatGPT concerning foundational knowledge, patient education, and treatment and care-related questions about TACE and HAIC in the Chinese context.
Subgroup analyses in the English context
As illustrated in Figure 4 and Table 4, compared to the Chinese context, the overall performance of ChatGPT is superior to that of ERNIE Bot in the English context (fundamental knowledge, P = 0.036; treatment and care, P = 0.015, Table 4). In the category of patient education, both ChatGPT and ERNIE Bot performed well with a non-significant difference.

A bar chart illustrating the ratings of responses related to foundational knowledge, patient education, and treatment and care-related questions, evaluated using a five-point Likert scale in the English context. *: P < 0.05; ns: no statistical difference. The horizontal axis scale (1–100) represents the percentage distribution of scores across different levels on the five-point Likert scale.
The ratings of responses provided by ERNIE Bot and ChatGPT concerning foundational knowledge, patient education, and treatment and care-related questions about TACE and HAIC in the English context.
LLMs’ performance varied by language in some questions
The results revealed a clear divergence in the performance of ERNIE Bot and ChatGPT based on different language contexts. For example, in the Chinese language context, when ERNIE Bot answered the question “What are the advantages and disadvantages of HAIC compared to TACE?,” the rating was 5. However, the statement “Most patients can complete TACE under outpatient conditions” in ChatGPT's answer is controversial (see Supplemental material), and the rating is 2. When ERNIE Bot answered the question “How soon can patients eat after TACE and what are the dietary requirements?,” the information was correct and complete, while ChatGPT's answer of “immediately after surgery and several hours after surgery” was not clear. It is generally recommended to fast for 4–6 h after TACE surgery, with a rating of 4.
For instance, in the English language context, when ChatGPT answered the question “What are the common chemotherapy regimens for HAIC?,” the information was correct and complete, and the rating was 5. However, the statement “It's important to note that HAIC is typically reserved for patients with liver-dominant metastatic disease who have failed or are not candidates for systemic chemotherapy or other local therapies. HAIC is generally considered a more aggressive approach and is associated with higher risks of complications compared to systemic chemotherapy. Therefore, it is typically only offered at specialized centers with experience in this type of treatment.” This passage generated by ERNIE Bot was disputed. Because HAIC is an invasive treatment (requiring a catheter to be implanted into the hepatic artery), it carries with it many catheter-related complications and local liver risks, such as catheter-related infections and worsening liver function. The side effects of systemic chemotherapy were more likely to be systemic toxicity, such as bone marrow suppression, gastrointestinal reaction, systemic immunity decline, and organ toxicity. Therefore, it will create a misunderstanding among patients that HAIC will lead to more serious systemic problems. The rating is 2. When ChatGPT answered the question “What are the adverse effects and complications after TACE? “, the information was correct and complete, while ERNIE Bot did not mention liver abscess or upper gastrointestinal bleeding in his answer to postoperative complications. And point 6 of the answer, “Death” is disputed as a result, not a complication, so it is rated 2.
Discussion
LLMs are generating considerable interest in the medical field. In medical diagnosis, ChatGPT has demonstrated accuracy and effectiveness comparable to that of professional rheumatologists when diagnosing inflammatory rheumatic diseases. Although it may occasionally fall short of clinical practitioners in specific details, ChatGPT can still provide correct differential diagnoses across a wide range of cases. 10 ChatGPT has also shown promise in patient education; for example, Campbell et al. 11 found that ChatGPT could appropriately answer most questions about thyroid nodules, regardless of prompt type. Additionally, ChatGPT achieved a 95% response accuracy rate when serving as an educational resource for patients with multiple myeloma, underscoring its potential as a reliable tool in patient education. 12 Despite these advances, there remains limited research evaluating the quality of LLMs’ responses to questions about TACE and HAIC.
This study provides an in-depth analysis of the performance of ERNIE Bot and ChatGPT in answering questions related to TACE and HAIC in both English and Chinese contexts, highlighting the potential of AI in interventional medicine. Notably, this research introduces ERNIE Bot, an AI language model, into the medical field for the first time and directly compares it with ChatGPT. In the Chinese context, ERNIE Bot demonstrates superior language comprehension and processing capabilities, delivering more accurate and comprehensive responses than ChatGPT. This advantage can largely be attributed to its training on Chinese-specific datasets and real-time updated databases, enabling it to provide more precise information. Such capabilities make ERNIE Bot a valuable tool for Chinese-speaking healthcare environments, where accurate comprehension of complex medical terminology is critical. Its excellent performance not only reflects the benefits of specific language model training but also indicates that the use of large language models in line with the national culture may better meet the needs of users. In contrast, ChatGPT demonstrated robust language generation capabilities in English contexts, consistent with its reputation as a versatile and widely applicable AI model. This performance highlights its potential as a tool for global healthcare communication and patient education in multilingual settings. However, ChatGPT's relatively weaker performance in Chinese underscores the challenges faced by general-purpose models when applied to linguistically and culturally diverse healthcare environments. Consistent with existing studies,13–19 our data also highlight the superior capabilities of ERNIE Bot and ChatGPT for patient education. These findings underscore the significant potential and value of AI in medical education. Accessible and accurate medical information is crucial for patients, as it helps them better understand their conditions and treatment options and increases their confidence in shared decision-making with healthcare providers, ultimately improving treatment outcomes. Research shows 20 that inadequate disease knowledge is closely associated with increased healthcare service utilization and costs. The integration of advanced LLMs to deliver personalized and precise educational support represents an innovative and efficient strategy for optimizing medical resources and enhancing the accessibility and convenience of healthcare services. Furthermore, this study confirms that patients selecting appropriate large language models in different contexts can achieve more accurate acquisition of medical information.
A detailed analysis of specific questions revealed that, despite the overall high performance of both large language models, some omissions and shortcomings remain in their responses. For instance, in the Chinese context, in response to “What are the adverse reactions and complications after TACE?” ERNIE Bot overlooked allergic reactions, whereas ChatGPT did not address these reactions or potential complications like ectopic embolization.
Additionally, in the English context, in answering “What are the indications and contraindications for HAIC treatment?” ChatGPT's list of contraindications omitted coagulation disorders and renal insufficiency. When responding to “What is HAIC (hepatic arterial infusion chemotherapy)?” ERNIE Bot did not mention the radial artery approach. In addressing “What are the advantages and disadvantages of HAIC compared to TACE?” ERNIE Bot stated that “Since HAIC does not use embolic agents, its effects may not be as durable as TACE” and that “HAIC is more invasive compared to TACE” which are contentious assertions. These omissions and inaccuracies can be attributed to several factors. First, ChatGPT's knowledge database was last updated in April 2023, potentially limiting its ability to provide the most current medical information. Additionally, both models may face limitations in training data, algorithms, and computational resources, leading to incomplete or inaccurate responses on specific medical topics. Lastly, the inherently complex and nuanced nature of medical practice demands comprehensive consideration of multiple factors, which general language models may not fully capture, resulting in gaps in their answers. Despite AI's significant potential, these limitations highlight the need for further optimization. To ensure the comprehensiveness and accuracy of medical information, it is crucial to combine AI with human review in practical applications. This integrated approach can foster substantial advancements in medical knowledge and health consultation, ultimately providing patients with more precise and comprehensive support.
Our findings highlight that while both models show promise, their unique strengths highlight some important considerations for developing and applying AI tools in clinical practice. First, the observed differences between ERNIE Bot and ChatGPT highlight the necessity of localized training to improve AI performance in specific regions and languages. Second, while ERNIE Bot performs well in a single language environment, combining its language-specific expertise with ChatGPT's multilingual versatility can better meet different clinical needs. Finally, while the strengths of both models in language generation and information synthesis underscore their potential as complementary tools for clinical education and decision making, careful human supervision is still needed.
Limitations
This study has certain limitations. Notably, it has yet to evaluate the comprehensibility of ERNIE Bot and ChatGPT for patients without a medical background. Such patients often require clear, concise, and easily understandable explanations and medical advice, a factor not fully addressed in the current research. Future studies should thus more rigorously assess the user-friendliness and practical effectiveness of these tools for patient populations with limited medical knowledge. This will be essential for enhancing the accessibility and utility of health information and advice offered by AI-driven platforms.
Conclusion
In conclusion, this study shows that ERNIE Bot and ChatGPT perform well in solving TACE and HAIC problems, highlighting the potential of AI in healthcare. In the Chinese context, ERNIE Bot outperforms ChatGPT, providing timely and accurate information to Chinese-speaking users. In the English context, ChatGPT outperforms and provides accurate and fluent healthcare information for English users. Therefore, choosing a suitable large language model is important for patients to get more accurate treatment.
Supplemental Material
sj-docx-1-dhj-10.1177_20552076251315511 - Supplemental material for Comparing the performance of ChatGPT and ERNIE Bot in answering questions regarding liver cancer interventional radiology in Chinese and English contexts: A comparative study
Supplemental material, sj-docx-1-dhj-10.1177_20552076251315511 for Comparing the performance of ChatGPT and ERNIE Bot in answering questions regarding liver cancer interventional radiology in Chinese and English contexts: A comparative study by Xue-ting Yuan, Chen-ye Shao, Zhen-zhen Zhang and Duo Qian in DIGITAL HEALTH
Footnotes
Acknowledgments
The authors acknowledge all reviewers.
Authorship
All authors meet the ICMJE authorship criteria.
Contributorship
XTY contributed to planning, designing writing, conception, data analysis, write-up, reference writing, and manuscript writing. CYS contributed to planning, designing writing, conception, data analysis, write-up, reference writing, and manuscript writing. DQ contributed to planning, designing writing, conception, data analysis, write-up, reference writing, and manuscript writing. ZZZ contributed to planning, designing writing, conception, data analysis, write-up, reference writing, and manuscript writing. XTY contributed to planning, designing writing, conception, data analysis, write-up, reference writing, and manuscript writing. CYS contributed to planning, designing writing, conception, data analysis, write-up, reference writing, and manuscript writing. DQ contributed to planning, designing writing, conception, data analysis, write-up, reference writing, and manuscript writing. XTY contributed to planning, designing writing, conception, data analysis, write-up, reference writing, and manuscript writing. CYS contributed to planning, designing writing, conception, data analysis, write-up, reference writing, and manuscript writing.
Declaration of conflicting interest
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Ethical approval
There are no human participants in this article and informed consent is not required.
Funding
The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This study was funded by Postgraduate Research & Practice Innovation Program of Jiangsu Province (SJCX24_1817).
Guarantor
All four authors take responsibility for the manuscript. All authors take responsibility for any liabilities regarding this case.
Supplemental material
Supplemental material for this article is available online.
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
