Abstract
Purpose:
This study evaluates the performance of large language models (LLMs) in the context of the Chinese National Traditional Chinese Medicine Licensing Examination (TCMLE).
Materials and Methods:
We compared the performances of different versions of Generative Pre-trained Transformer (GPT) and Enhanced Representation through Knowledge Integration (ERNIE) using historical TCMLE questions.
Results:
ERNIE-4.0 outperformed all other models with an accuracy of 81.7%, followed by ERNIE-3.5 (75.2%), GPT-4o (74.8%), and GPT-4 turbo (50.7%). For questions related to Western internal medicine, all models showed high accuracy above 86.7%.
Conclusion:
The study highlights the significance of cultural context in training data, influencing the performance of LLMs in specific medical examinations.
Get full access to this article
View all access options for this article.
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
