Sage Journals: Discover world-class research

Abstract

Objectives

Large language models (LLMs) are increasingly used in healthcare, with the potential for various applications. However, the performance of different LLMs on nursing license exams and their tendencies to make errors remain unclear. This study aimed to evaluate the accuracy of LLMs on basic nursing knowledge and identify trends in incorrect answers.

Methods

The dataset consisted of 692 questions from the Japanese national nursing examinations over the past 3 years (2021–2023) that were structured with 240 multiple-choice questions per year and a total score of 300 points. The LLMs tested were ChatGPT-3.5, ChatGPT-4, and Microsoft Copilot. Questions were manually entered into each LLM, and their answers were collected. Accuracy rates were calculated to assess whether the LLMs could pass the exam, and deductive content analysis and Chi-squared tests were conducted to identify the tendency of incorrect answers.

Results

For over 3 years, the mean total score and standard deviation (SD) using ChatGPT-3.5, ChatGPT-4, and Microsoft Copilot was 180.3 ± 22.2, 251.0 ± 13.1, and 256.7 ± 14.0, respectively. ChatGPT-4 and Microsoft Copilot showed sufficient accuracy rates to pass the examinations for all the years. All LLMs made more mistakes in the health support and social security system domains (p < 0.01).

Conclusions

ChatGPT-4 and Microsoft Copilot may perform better than Chat GPT-3.5, and LLMs could incorrectly answer questions about laws and demographic data specific to a particular country.

Keywords

Nursing large language model ChatGPT Microsoft Copilot artificial intelligence

Introduction

Large language models (LLMs) are increasingly integrated into various domains, including healthcare, demonstrating potential benefits and challenges.^1–5 LLMs also demonstrate numerous possibilities for future clinical applications.^6,7

Some LLMs, such as ChatGPT, developed by OpenAI, have been used to evaluate the performance of medical-setting knowledge in previous studies. In various countries’ medical license examinations, ChatGPT-3.5 often fails to reach passing criteria, whereas ChatGPT-4 has achieved a passing performance.⁸ Differences in performance between several LLMs have been reported in previous studies that suggest that ChatGPT-4 and Microsoft Copilot showed good performance for medical knowledge,⁹ but Google Bard may have a slightly lower performance than ChatGPT-4.¹⁰ In addition, ChatGPT-4 shows good performance in license examinations of other healthcare providers.^11,12 In nursing, ChatGPT-4 has demonstrated sufficient performance to pass the national nursing examination, whereas ChatGPT-3.5 has not, consistent with other healthcare license exam findings.^13–15 In addition, the performance of LLMs in nursing has been evaluated for several outcomes, including critical settings and nursing education.¹⁶ However, while these studies showed various possibilities for using LLMs in nursing, the differences in performance between ChatGPT and other LLMs were unclear regarding nursing license examinations.

LLMs may perform well with basic knowledge in medical settings. However, it is generally said that LLMs are incomplete and may sometimes answer incorrectly.¹⁷ In a previous study, LLMs did not pass license examinations with complete answers, indicating that LLMs are incomplete; the error types of LLMs were examined in relation to medical knowledge.¹⁸ However, it is unclear in which domains LLMs tend to incorrectly provide nursing knowledge. Understanding the domains to which LLMs tend to answer incorrectly is important for the effective use of LLMs by healthcare providers, particularly nurses.

Therefore, this study aimed to evaluate the performance of LLMs in terms of the accuracy rate for basic knowledge in nursing and gain insight into the tendency to answer incorrectly. The results of this study may guide nurses in the use of LLMs.

Materials and methods

Characteristics of examination

The Japanese national nursing examination is organized by the Ministry of Health, Labor, and Welfare (MHLW). The questionnaire used in this study was the Japanese National Nursing Examination for 3 years from 2021 to 2023. All the data, including the examination used in this study are publicly available. Details of the data are provided in the supplementary file (S-text 1). The examination consisted of 240 multiple-choice questions, in which examinees were asked to select one or two answers from a set of four or five options. Among these, 50 questions were designated as “compulsory questions” covering fundamental aspects of nursing knowledge. The remaining questions include 130 “general questions” and 60 “scenario-based questions.” In the examination, each correct answer to compulsory and general questions was worth one point, while correct answers to scenario-based questions were worth two points. Therefore, the total score for the examination was 300 points if all questions were answered correctly. The passing criteria were as follows: (i) achieving an accuracy rate of more than 80% (40 points) in the compulsory questions, and (ii) achieving a combined accuracy rate in general and scenario-based questions that exceeded the borderline score set by the MHLW. In the years considered in this study, the borderline scores were 159, 167, and 152 points for 2021, 2022, and 2023, respectively. The MHLW has determined and made public criteria for the examination, including 11 domains: structure and function of the human body, understanding disease and promoting recovery, health support and social security systems, basic nursing, adult nursing, geriatric nursing, pediatric nursing, maternal nursing, psychiatric nursing, home care nursing theory, as well as integrated and practical nursing. The domains where the questions belonged have not been opened to the public.

Analysis with LLMs

The LLM used in this study comprised three models: ChatGPT-3.5, ChatGPT-4 (OpenAI Incorporated, Mission District, San Francisco, USA), and Microsoft Copilot (Microsoft Corporation, WA, USA). We used the free versions of ChatGPT-3.5 and Microsoft Copilot and the subscribed version of ChatGPT-4. Responses by ChatGPT-4 were collected between January 25 and April 5, 2024; responses by Microsoft Copilot were collected between February 2 and March 5, 2024; and responses by ChatGPT-3.5 were collected between July 5 and July 29, 2024. All responses were collected in the laboratory of each researcher's institution. The questions were manually entered into each LLM via the website (not the Application Programming Interface [API]). Entering the question and collecting answers from the LLM were divided between two researchers (TK and KH). Before entering the first question, researchers provided the following prompt as input: “Please answer the questions to confirm your knowledge of nursing in Japan. This is a multiple-choice question, so please select the correct answer.” For subsequent questions, researchers preceded each question with the prompt: “This is the next question.” If the LLM did not provide the designated number of responses, the researchers added an additional prompt instructing it to comply with the specified answer format: “Please select one or two answers.” This process was standardized between the two researchers, and each entered prompt and retrieved answer was carefully checked by the two researchers separately to ensure their correctness. The 1-year examination was conducted by chat in each LLM. The collected answers were scored using the official corrected answers published by the MHLW. Image and table questions were excluded from this study because the LLMs could not be recognized. In addition, inappropriate questions from the MHLW that were open to the public were excluded.

Analysis of LLM results

A two-step approach was used to analyze the data. First, answers were collected, and the correct answer rate was calculated using compulsory questions. Then, the total points were calculated each time to evaluate whether each LLM passed the examination.

Next, deductive content analysis¹⁹ was performed to identify error trends in the answers of the LLMs. This approach helps identify critical concepts based on existing theories and previous research. This study used a categorization matrix with 11 domains that were determined and opened by the MHLW, and coding and categorization were initiated. Incorrect answers were classified into groups of incorrectly answered LLMs before deductive content analysis. Deductive content analysis was performed as follows. (i) Incorrect answers to all examinations were prepared, and their content was used as the unit of analysis. (ii) Two researchers (TK and KH) read the question text and the choices of all incorrect answers and coded them accordingly. (iii) The codes were placed in a categorization matrix of 11 domains. Examples of incorrect answers outputted by LLMs were provided in the supplementary file, along with potential reasons that were discussed among researchers.

Statistical analysis

Descriptive analyses were performed in the first step of the LLM analysis to evaluate whether each LLM passed the examination. In addition, a one-way analysis of variance (ANOVA) for 3 years of examinations was performed with LLM models as a factor to analyze the differences in total scores among the LLMs. Furthermore, the analysis was followed by Turkey–Karmer's methods for exploratory multiple comparisons among groups.

In the second step to identify error trends in the answers of the LLMs, the number of codes analyzed in each category with deductive content analysis was counted, and a Chi-squared goodness-of-fit test and residual analysis were performed. The expected value was calculated by dividing the total number of incorrect answers by the number of domains, based on the hypothesis of uniform distribution across domains. A Chi-squared goodness-of-fit test was also conducted to examine whether errors were uniformly distributed across the domains. Residual analysis was performed using standardized residuals, not adjusted standardized residuals, because the analysis was conducted on a single column of frequency data.

All statistical analyses were performed using R version 4.2.2 (R Foundation for Statistical Computing, Vienna, Austria). Statistical significance was defined as p < 0.05.

Ethical considerations

This study did not involve human or animal participants, and all the data used in this study are publicly available on the Internet. Therefore, ethical approval and patient consent were not required.

Results

Input data statistics

The number of questions included in the analysis was 692 from the Japanese National Nursing License Examinations for 3 years. Of these, 234 questions were from 2021, excluding six questions on images or tables, 228 questions were from 2022, excluding 10 image or table questions and two inappropriate questions, and 230 questions were from 2023, excluding nine image or table questions and one inappropriate question.

Accuracy rate

The ChatGPT-3.5 could pass none of the years because the accuracy rate was lower than 80% for the compulsory questions. ChatGPT-4 and Microsoft Copilot passed the examinations in all years with an accuracy rate of more than 80% for compulsory questions, and the total score was above the borderline. The mean total score with ChatGPT-3.5 over 3 years was 180.3, and the standard deviation (SD) was 22.2, whereas for ChatGPT-4 and Microsoft Copilot, the mean and SD was 251.0 ± 13.1, and 256.7 ± 14.0, respectively. The differences in the total score among the LLMs were not statistically significant, as shown by the one-way ANOVA results (effect size = 0.398, 95% confidence interval [CI] [0, 1], F (2, 6) = 3.97, p = 0.08). The mean score of ChatGPT-3.5 was 53.7 points lower than that of Microsoft Copilot; however, exploratory multiple comparisons using Tukey–Kramer's methods showed there was no statistical difference between ChatGPT-3.5 vs. Microsoft Copilot (95%CI [-117.9, 10.6], p = 0.09). Similarly, 48.0 points difference between ChatGPT-4 and ChatGPT-3.5 (95% CI [-16.2, 112.2], p = 0.13), and ChatGPT-4 scored 5.7 points lower than Microsoft Copilot (95% CI [-69.9, 58.6], p = 0.96), these differences were not statistically significant. Details are shown in Table 1.

Table 1.

Test scores and accuracy rates.

	Year (point)	ChatGPT-3.5	ChatGPT-4	Copilot
Score of compulsory questions (%)	2021 (48)	38 (79.2)	46 (95.8)^a	47 (97.9)^a
	2022 (46)	29 (63.0)	38 (82.6)^a	44 (95.7)^a
	2023 (49)	32 (65.3)	45 (91.8)^a	49 (100)^a
Total score (%)	2021 (294)	205 (69.7)	266 (90.5)^a	270 (91.8)^a
	2022 (285)	162 (56.8)	242 (84.9)^a	258 (90.5)^a
	2023 (288)	174 (60.4)	245 (85.1)^a	242 (84.0)^a
Total mean score (SD)		180.3 (22.2)	251.0 (13.1)	256.7 (14.0)

Abbreviation: SD, standard deviation.

Archived for passing the criteria.

Tendency to answer incorrectly

The number of incorrectly answered questions that were answered incorrectly by all LLMs was 35, and the group that were answered incorrectly by ChatGPT-3.5 and ChatGPT-4 included 35 questions. The results of deductive content analysis and a number of questions categorized in each domain for groups answered incorrectly by all LLMs and ChatGPT-3.5 and ChatGPT-4 are shown in Tables 2 and 3.

Table 2.

Domains of incorrect answers of all LLMs.

Domains (n)	Main category (n)
Structure and function of human body (0)	NA
Understanding disease and promoting recovery (0)	NA
Health support and social security systems (15)	Health indicator and prevention (6)
	Social insurance systems (3)
	Health activities (2)
	Bases of daily life (1)
	Functions and roles of medical facility and health care providers (1)
	Lifestyle (1)
	Philosophy and policies of law related to social welfare (1)
Basic nursing (5)	Basic nursing skills for daily life support (2)
	General basic nursing skills (2)
	Nursing skills for medical examination (1)
Adult nursing (2)	Caring for patients with motor dysfunction (1)
Adult nursing (2)	Critical care nursing (1)
Geriatric nursing (4)	Caring for symptoms, diseases and dysfunctions particular in elderly patients (3)
Geriatric nursing (4)	Characteristics of geriatric nursing (1)
Pediatric nursing (2)	Effect of hospitalization or diseases for pediatric patient and family (1)
Pediatric nursing (2)	Caring for pediatric patients and family in several situations (1)
Maternal nursing (3)	Sex and sexuality of the human (2)
Maternal nursing (3)	Caring for perinatal patients and family (1)
Psychiatric nursing (1)	Psychiatric healthcare (1)
Home care nursing theory (2)	Caring for patient and family stayed in home (2)
Integrated and practical nursing (1)	Disaster nursing (1)

Abbreviation: NA, not applicable.

Table 3.

Domains of incorrect answers of ChatGPT-3.5 and ChatGPT-4.

Domains (n)	Main category (n)
Structure and function of human body (0)	NA
Understanding disease and promoting recovery (0)	NA
Health support and social security systems (9)	Health activities (2)
	Health indicator and prevention (2)
	Functions and roles of medical facility and health care providers (2)
	Bases of daily life (2)
	Philosophy and policies of law related to social welfare (1)
Basic nursing (5)	Nursing skills for medical examination (2)
	General basic nursing skills (2)
	Basic nursing skills for daily life support (1)
Adult nursing (10)	Cancer nursing (2)
	Critical care nursing (2)
	Caring for patients with digestive and absorption dysfunction (1)
	Concepts of basis on nursing (1)
	Caring for patients with motor dysfunction (1)
	Caring for patients with endocrine dysfunction (1)
	Caring for patients with cerebral nerve dysfunction (1)
	Maintenance and promotion of health and prevention of disease in adults (1)
Geriatric nursing (1)	Nursing insurance and medical insurance in geriatric nursing (1)
Pediatric nursing (2)	Caring for pediatric patients and family by stage of health impairment (1)
Pediatric nursing (2)	Growth and development of pediatric (1)
Maternal nursing (1)	Caring for perinatal patients and family (1)
Psychiatric nursing (3)	History and legal system of mental health and medical welfare (1)
	Assistance focused on biological aspects (1)
	Understanding and policies of law related to social welfare (1)
Home care nursing theory (2)	Functions and roles of home care nursing (1)
Home care nursing theory (2)	Home care nursing overview (1)
Integrated and practical nursing (2)	Disaster nursing (1)
Integrated and practical nursing (2)	Nursing management (1)

Abbreviation: NA, not applicable.

In the group of all LLMs that were incorrectly answered, the domains of health support and social security systems had the highest number of incorrect answers (15 questions). The next highest number was basic nursing with five questions and four questions were incorrect in the domains of geriatric nursing. In the group of ChatGPT-3.5 and ChatGPT-4 were incorrectly answered, the highest number of incorrect answers was the domains of adult nursing with ten. The next highest number of incorrect answers was the domains of health support and social security systems with nine, and basic nursing with five. The details of the other incorrectly answered groups and examples of incorrect answers are provided in the supplementary file (S-table 1 to 5). The results of the Chi-squared goodness-of-fit test of the group of incorrectly answered questions by all the LLMs showed a statistically significant difference (χ² (10) = 55.8, expected frequencies = 3.18, p < 0.01), and the standardized residual in the residual analysis showed a statistically significant difference in the domains of health support and social security systems (p < 0.01). For the ChatGPT-3.5 and ChatGPT-4 group, the results of the Chi-squared goodness-of-fit test showed a statistically significant difference (χ² (10) = 36.97, expected frequencies = 3.18, p < 0.01), and the standardized residual in the residual analysis showed statistically significant differences in the domains of health support and social security systems, and adult nursing (p < 0.01). Details of the results are listed in Table 4.

Table 4.

Tendency to answer incorrectly in each domain.

Domains	Incorrected all LLMs (n = 35)		Incorrected ChatGPT-3.5 and ChatGPT-4 (n = 35)
Domains	Number of answers (%)	Standardized residual	Number of answers (%)	Standardized residual
Structure and function of human body	0 (0)	−1.78	0 (0)	−1.78
Understanding disease and promoting recovery	0 (0)	−1.78	0 (0)	−1.78
Health support and social security systems	15 (42.9)	6.63**	9 (25.7)	3.26**
Basic nursing	5 (14.3)	0.31	5 (14.3)	1.02
Adult nursing	2 (5.7)	0.51	10 (28.6)	3.82**
Geriatric nursing	4 (11.4)	0.65	1 (2.9)	−1.22
Pediatric nursing	2 (5.7)	0.51	2 (5.7)	−0.66
Maternal nursing	3 (8.6)	0.92	1 (2.9)	−1.22
Psychiatric nursing	1 (2.9)	0.22	3 (8.6)	−0.10
Home care nursing theory	2 (5.7)	0.51	2 (5.7)	−0.66
Integrated and practical nursing	1 (2.9)	0.22	2 (5.7)	−0.66

Abbreviation: LLM, large language model.

** Statistically significant differences result from a residual analysis as p < 0.01.

Discussion

The findings of this study highlight the performance of three LLMs on the Japanese National Nursing Examination and reveal their common patterns of answering incorrectly. ChatGPT-4 and Microsoft Copilot achieve passing accuracy rates, whereas ChatGPT-3.5 did not. All LLMs tested in this study made more mistakes in the health support and social security systems domains; in particular, ChatGPT-3.5 and ChatGPT-4 answered incorrectly for the health support, social security systems, and adult nursing domains.

The accuracy rate of ChatGPT-4 revealed in this study was sufficient to pass the national license examinations, similar to previous studies. The results of this study showed that ChatGPT-4 had sufficient performance to pass the Japanese nursing license examination for all years used in this study, but ChatGPT-3.5 did not perform well in passing the examinations, not even one of them. Our results are consistent with the previous study from China that ChatGPT-4 showed higher performance than ChatGPT-3.5 in national nursing examinations of the United States and China.²⁰ Several previous studies evaluating the performance of the license examination of other healthcare providers showed : ChatGPT-4 had sufficient performance to pass the examinations, but ChatGPT-3.5 did not have sufficient performance.^11–13 Originally, GPT-4 technology was developed as a higher-performance model than ChatGPT-3.5 for various professional and academic benchmarks.¹⁷ The results of this study demonstrate the actual performance of ChatGPT-4. It was not possible to compare the official accuracy rate and mean score for all participants of the Japanese nursing license examination because of their unavailability to the public by the MHLW.

In addition, the accuracy rate of Microsoft Copilot is similar to that of ChatGPT-4. This study revealed that ChatGPT-4 and Microsoft Copilot had sufficient performance to pass the Japanese nursing license examination, and the average total score over 3 years did not show statistical significance. A previous study showed that Microsoft Copilot had the same performance as ChatGPT-4,⁹ which is similar to our results. The performances of ChatGPT-4 and Microsoft Copilot have a high probability of being the same because Microsoft Copilot was developed based on GPT-4 technology, similar to ChatGPT-4.¹ Thus, it is considered appropriate for the performances of ChatGPT-4 and Microsoft Copilot to be similar. The differences among the LLMs were not statistically significant (F (2, 6) = 3.97, p = 0.08), and the effect size was large (0.398). However, the wide 95% CI [0, 1] indicates substantial uncertainty in this estimate, likely due to the small sample size. Exploratory multiple comparisons using Tukey–Kramer's method also revealed no statistically significant differences between any pairs of LLMs, although ChatGPT-3 scored 53.7 points lower than Microsoft Copilot (p = 0.09), suggesting a potential difference that might have been detected if the sample size had been larger.

Finally, the LLMs may not be all-round and may have domains that tend to answer questions incorrectly, which could be due to the training data of the GPT-4 technology. The results of this study can be classified into groups of incorrectly answered LLMs, and the mistakes were similar among the LLMs. If LLMs had no performance bias, errors would be evenly distributed across the 11 domains. However, the deductive content analysis indicated that while the LLMs generally performed well on basic nursing knowledge applicable worldwide, they frequently incorrectly answered questions involving Japan-specific laws and demographic data. These differences were statistically significant based on the Chi-squared goodness-of-fit test and residual analysis. This situation may be related to the fact that ChatGPT and Microsoft Copilot were developed using the same or similar technology as GPT-3.5 and GPT-4.¹ The GPT-4 technology could respond to several languages, including Japanese.¹⁷ However, the GPT technology is not complete and has some error types that have been clearly shown²¹; the GPT technology can incorrectly understand tasks.^18,21 Moreover, the LLMs may have more difficulties and complexities with the Japanese language compared with the English language, because the Japanese language networks differ significantly from English language networks owing to their different grammatical features.²² This suggests that the LLMs may not have correctly understood the examination written in Japanese, as used in this study. In addition, LLMs may have inherent biases in answering questions about incorrect domains that were shown in this study. The health support and social security systems domains include knowledge of Japan-specific laws and demographic data. GPT-4 technology was trained using publicly available data such as Internet data and data licensed by third-party providers.¹⁷ However, a previous study suggested that GPT-4 technology may have inherent biases that were reflected in the training data, and the details of the training set were not described.^17,23 Consequently, the biased results in incorrectly answered domains may have reflected performance biases in LLMs stemming from the training data. Specifically, these inaccuracies may be ascribed to the lack of exposure of LLM to detailed, country-specific data. Thus, incorporating such localized data during the training phase could potentially enhance the accuracy and relevance of responses in region-specific contexts such as country-specific laws and demographic data. Moreover, LLMs, which are fine-tuned for the Japanese language, were developed in recent years,²⁴ and their respective companies may resolve the problems related to languages gradually in the future.

A key strength of this study is that the results had high validity because performance evaluations were performed using large sample sizes as multi-year examinations. However, this study has some limitations. First, the applicability of the results may be limited. The Japanese National Nursing Examination includes not only nursing knowledge worldwide but also Japan-specific laws and demographic data. Furthermore, ChatGPT-4 was the latest model at the time of data analysis, but new models have now been published, such as ChatGPT-4o. Thus, it is unclear and may depend on the training data of GPT technology whether LLMs show the same performance as in this study when examining examinations of other countries or new models of ChatGPT, especially the tendency to answer incorrectly. Second, this study could only evaluate selective performance. The Japanese National Nursing Examination is structured with only multiple-choice questions and uses only Japanese language, with the setting limited to only Japanese clinical settings; image or table questions were excluded from this study. The performance of the LLM in answers to open-ended question, regarding other languages or countries, and to image or table questions was unclear. Third, biases related to the manual entry of questions into each LLM may have occurred. The manual entry methodology could introduce unintentional errors or inconsistencies. However, all biases that could have significantly affected our findings are assumed to have been resolved because each entered prompt and retrieved answer were carefully checked by two researchers (TK and KH) separately. In future studies, using APIs may provide a more robust methodology. Finally, there were differences between the actual examinees. The accuracy rate could not be compared with that of the actual examinees because the accuracy rate of the actual examinees was not published publicly. The difference in the accuracy rate between the results of this study and those of actual examinees was unclear.

Nurses in clinical settings and nursing educators should recognize the potential problems of reliability when using LLMs, as incorrect answers due to biases in training data may be obtained, not only in Japan but also in other countries. LLMs are inherently imperfect, and their performance may be particularly limited in some domains, such as legal or demographic information specific to certain countries, with a high potential for errors, which could lead to incorrect conclusions. Additionally, personal information must not be entered, as LLMs may learn from input data, posing a risk of data leakage, when the LLMs are used via the website.

However, these uncertainties can be managed by ensuring accountability through cross-checking LLM responses with human experts and official guidelines and by anonymizing input data to include only generalized information. These practices can help ensure a safer and more reliable use of LLMs. This study did not evaluate LLM performance in handling advanced clinical nursing knowledge, open-ended questions, or questions involving images and tables. Recent studies have suggested that current LLMs such as ChatGPT-4 omni (ChatGPT-4o) have the potential for high performance to evaluate the images and tables²⁵; furthermore, the LLMs with fine-tuned with specific certain country's data have developping.²⁴ These recent LLMs may have higher performance potential than the LLMs used in this study. Thus, future research should address that combining multiple recent LLMs such as ChatGPT-4o, and fine-tuned LLMs to specific data may provide a better accuracy rate than in this result. The performance of LLMs to interpret such questions may not only improve daily nursing care and education but also may be essential for all healthcare providers in the future.

Conclusion

ChatGPT-4 and Microsoft Copilot may perform better than ChatGPT-3.5, and LLMs could incorrectly answer laws and demographic data specific to a particular country. Future studies should conduct more exploratory studies to provide more evidence.

Supplemental Material

sj-docx-1-dhj-10.1177_20552076251346571 - Supplemental material for Performance evaluation of large language models for the national nursing examination in Japan

Supplemental material, sj-docx-1-dhj-10.1177_20552076251346571 for Performance evaluation of large language models for the national nursing examination in Japan by Tomoki Kuribara, Kengo Hirayama and Kenji Hirata in DIGITAL HEALTH

Footnotes

Acknowledgments

The authors would like to thank Editage () for English language editing.

ORCID iDs

Tomoki Kuribara

Kengo Hirayama

Kenji Hirata

Ethical considerations

This study did not involve human or animal participants, and all the data used in this study are publicly available on the Internet. Therefore, ethical approval and patient consent were not required.

Author contributions

Tomoki Kuribara and Kenji Hirata conceptualized this concept. Tomoki Kuribara developed the methodology and drafted the manuscript. Tomoki Kuribara and Kengo Hirayama performed the data analyses. Kenji Hirata made a supervision. All authors reviewed, edited, and approved the final manuscript.

Funding

The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: the Japan Society for the Promotion of Science (grant number JP24K13831).

Declaration of conflicting interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Supplemental material

Supplemental material for this article is available online.

References

Ebert

Louridas

. Generative AI for Software Practitioners. IEEE Software, 2023, pp.30–38.

Singhal

Azizi

, et al. Large language models encode clinical knowledge. Nature 2023; 620: 172–180.

Minssen

Vayena

Cohen

. The challenges for regulating medical use of ChatGPT and other large language models. JAMA 2023; 330: 315–316.

Thirunavukarasu

Ting

DSJ

Elangovan

, et al. Large language models in medicine. Nat Med 2023; 29: 1930–1940.

Webster

. Six ways large language models are changing healthcare. Nat Med 2023; 29: 2969–2971.

Ullah

Parwani

Baig

, et al. Challenges and barriers of using large language models (LLM) such as ChatGPT for diagnostic medicine with a focus on digital pathology—a recent scoping review. Diagn Pathol 2024; 19: 43.

Will ChatGPT transform healthcare?

Nat Med 2023; 29: 505–506.

Liu

Okuhara

Chang

, et al. Performance of ChatGPT across different versions in medical licensing examinations worldwide: systematic review and meta-analysis. J Med Internet Res 2024; 26: e60807.

Rossettini

Rodeghiero

Corradi

, et al. Comparative accuracy of ChatGPT-4, Microsoft Copilot and Google Gemini in the Italian entrance test for healthcare sciences degrees: a cross-sectional study. BMC Med Educ 2024; 24: 694.

10.

Chan

Dong

Angelini

. The performance of large language models in intercollegiate Membership of the Royal College of Surgeons examination. Ann R Coll Surg Engl 2024; 106: 700–704.

11.

Toyama

Harigai

Abe

, et al. Performance evaluation of ChatGPT, GPT-4, and Bard on the official board examination of the Japan Radiology Society. Jpn J Radiol 2024; 42: 201–207.

12.

Sato

Ogasawara

. ChatGPT (GPT-4) passed the Japanese National License Examination for Pharmacists in 2022, answering all items including those with diagrams: a descriptive study. J Educ Eval Health Prof 2024; 21: 4.

13.

Kaneda

Takahashi

Kaneda

, et al. Assessing the performance of GPT-3.5 and GPT-4 on the 2023 Japanese nursing examination. Cureus 2023; 15: e42924.

14.

Taira

Itaya

Hanada

. Performance of the large language model ChatGPT on the national nurse examinations in Japan: evaluation study. JMIR Nurs 2023; 6: e47305–20230627.

15.

Zong

, et al. Performance of ChatGPT on Chinese national medical licensing examinations: a five-year examination evaluation study for physicians, pharmacists and nurses. BMC Med Educ 2024; 24: 143.

16.

Hobensack

von Gerich

Vyas

, et al. A rapid review on current and potential uses of large language models in nursing. Int J Nurs Stud 2024; 154: 104753.

17.

OpenAI. GPT-4 Technical Report. arXiv, 2023.

18.

Arora

Silburt

Phillips

, et al. A blinded comparison of three generative artificial intelligence chatbots for orthopaedic surgery therapeutic questions. Cureus 2024; 16: e65343.

19.

Elo

Kyngäs

. The qualitative content analysis process. J Adv Nurs 2008; 62: 107–115.

20.

Gan

Xue

, et al. Performance of ChatGPT on nursing licensure examinations in the United States and China: cross-sectional study. JMIR Med Educ 2024; 10: e52746.

21.

Roy

Khatua

Ghoochani

, et al. Beyond accuracy: investigating error types in GPT-4 responses to USMLE questions. In: Proceedings of the 47th international ACM SIGIR conference on research and development in information retrieval, 2024, pp.1073–1082. Washington, DC: Association for Computing Machinery.

22.

Tatara

Shimada

Fujiwara

, et al. Analysis on differences of Japanese and English languages by the complex network theory. IEICE Proc Ser 2016; 48: A3L-C-4.

23.

Gallifant

Fiske

Levites Strekalova

, et al. Peer review of GPT-4 technical report and systems card. PLOS Digit Health 2024; 3: e0000417.

24.

Morishita

Yamaguchi

Morio

, et al. JFLD: a Japanese benchmark for deductive reasoning based on formal logic. In: Proceedings of the 2024 joint international conference on computational linguistics, language resources and evaluation (LREC-COLING 2024), 2024, pp.9526–9535. Torino, Italia: ELRA and ICCL.

25.

Oura

Tatekawa

Horiguchi

, et al. Diagnostic accuracy of vision-language models on Japanese diagnostic radiology, nuclear medicine, and interventional radiology specialty board examinations. Jpn J Radiol 2024; 42: 1392–1398.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.03 MB