Sage Journals: Discover world-class research

Abstract

Objective

This feasibility study aimed to assess whether minimally customized large language models (LLMs) can achieve knowledge levels relevant for self-medication support by evaluating their performance on the Japanese Registered Salesperson Examination. Registered salespersons provide consumers with guidance on appropriate selection and safe use of over-the-counter drugs in Japan. We explored whether LLMs customized through simple reference materials upload, without requiring programming skills or advanced prompt engineering, could achieve comparable knowledge levels.

Methods

We used the 2024 Registered Salesperson Examination in Japan as a benchmark dataset, comprising 838 text-only multiple-choice questions across five domains. We created customized GPTs by uploading the official guide for examination question development as a PDF file to GPT-4o, without additional prompts or instructions. Both GPT-4o and the customized GPTs were evaluated using a zero-shot approach with web-browsing capabilities disabled.

Results

The overall accuracy was 78.40% for GPT-4o and 92.36% for the GPTs, with this difference being statistically significant (p < 0.001). The GPTs significantly outperformed GPT-4o across all five domains (p < 0.05).

Conclusion

This feasibility study demonstrates that minimal customization through reference materials upload can achieve performance improvements on knowledge-based assessments. However, examination performance represents only an indirect indicator of knowledge for self-medication support. Future research should evaluate LLMs performance in real-world case scenarios and assess practical utility and safety before implementation in consumer self-medication settings.

Keywords

Self-medication generative artificial intelligence ChatGPT GPTs over-the-counter drugs

Introduction

Large language models (LLMs) became widely recognized following the release of ChatGPT in November 2022.¹ LLMs are rapidly influencing research, education, and professional practice in the healthcare field. For example, studies have been conducted to assess the performance of LLMs on national licensing examination questions in fields such as medicine,^2,3 dentistry,^4,5 and pharmacy.^6,7 However, these studies have reported that LLM performance varies across domains, with some showing low accuracy rates. These findings suggest limitations of general LLMs in certain domains.^2–7 To address these limitations, customizing LLMs for specific tasks or domains has gained attention.

In 2023, OpenAI introduced GPTs, which are custom versions of ChatGPT that can be created for specific purposes.⁸ The introduction of GPTs has made lightweight customization more accessible, and users can create domain-tailored versions with embedded knowledge and instructions without requiring advanced techniques such as retrieval-augmented generation (RAG) or fine-tuning. Recent studies have demonstrated that customized LLMs can be applied in various healthcare domains, such as management of pancreatic cystic lesions,⁹ specialized education in anatomical sciences,¹⁰ and radiographic diagnosis in dentistry.¹¹ Although these approaches typically require specialized knowledge for selecting reference materials and designing prompts, they support the promise of customized LLMs in specialized domains, including clinical decision-making, professional training, and diagnostic support.

Beyond these specialized professional applications, customized LLMs hold potential as a digital health tool for consumer-oriented health support, for example in self-medication. Such applications could contribute to expanding the digital health community by broadening the scope of AI-assisted health support from professional clinical settings to consumer self-care.

Self-medication is becoming increasingly important around the world. It allows individuals to take responsibility for their own health and treat minor illnesses on their own. In Japan, the government has promoted self-care as a national strategy to control healthcare costs and extend healthy life expectancy through preventive activities and self-care initiatives.¹² As part of the policy framework, the registered salesperson system was established in 2008 to enhance accessibility to over-the-counter (OTC) medications.¹³ Registered salespersons are certified professionals who are distinct from pharmacists. Even in the absence of a pharmacist, they are authorized to sell OTC drugs with relatively low risks of adverse effects and to provide usage guidance to consumers.

The guidance provided by registered salespersons may be an important source of knowledge for the general public in practicing self-medication. Therefore, LLMs with the knowledge of a registered salesperson could support the public in self-medication by helping users select appropriate OTC drugs, use them safely, and recognize when to consult healthcare professionals. To date, however, no attempts have been made to construct LLMs with such knowledge through minimal, non-technical methods that general users can perform. Given the potential benefits for public health, we considered it worthwhile to explore this possibility.

The aim of this feasibility study was to assess whether minimally customized LLMs can achieve knowledge levels relevant for self-medication support, for their future application in real-world self-medication scenarios. As noted above, general LLMs have shown variable performance across domains, with potential limitations in certain specialized areas. Customization of LLMs may help address these limitations. Healthcare professionals or subject matter experts with limited programming skills could potentially create customized LLMs by selecting and uploading appropriate reference materials. This study focuses on a minimal customization approach using only uploaded reference materials without additional prompts. To objectively evaluate knowledge required for OTC drug guidance, we compared the performance of GPTs customized with official examination guidelines and non-customized GPT-4o using the Japanese Registered Salesperson Examination.

Materials and methods

Dataset

This study used the 2024 Japanese Registered Salesperson Examination (RSE-2024) in Japan as a benchmark dataset for evaluating LLMs performance.^14–20 The RSE-2024 is a nationally standardized qualification examination that assesses the knowledge necessary for registered salespersons to sell OTC drugs and provide usage guidance to consumers. The examination questions are publicly available on the official websites of prefectural governments in Japan. No examination content was reproduced in this paper. The questions were used solely for performance evaluation, and only the numerical results are reported. Therefore, no specific permission was required for this use. The examination consists of five domains: (1) common characteristics and basic knowledge of pharmaceuticals, (2) human body functions and pharmaceuticals, (3) major pharmaceuticals and their actions, (4) pharmaceutical-affairs related laws and regulatory systems, and (5) proper use and safety measures for pharmaceuticals.²¹ The examination comprises 120 text-only multiple-choice questions. It includes 40 questions from major pharmaceuticals and their actions, and 80 questions from the other four domains, with 20 questions per domain. The 2024 examination was administered using seven distinct examination sets corresponding to seven regional blocks across Japan. We collected all seven sets from official public sources, capturing all examination questions administered throughout Japan.^14–20 Two questions were officially withdrawn from scoring by the examination authority due to difficulty in determining the correct answers based on the official examination guidelines.²⁰ These questions were excluded from our analysis, resulting in a final dataset of 838 questions.

Customized GPTs and evaluation procedure

We evaluated performance on the RSE-2024 using GPT-4o (4o; OpenAI Global, San Francisco, CA, USA, released on May 13, 2024) and the GPTs. GPT-4o was selected as the baseline comparator because it was freely accessible to general users at the time of our study, which aligns with our focus on consumer-oriented self-medication support.²¹ Customizing GPTs requires no programming skills, making it an accessible and intuitive task.⁸ We created the GPTs, which is a customized version of GPT-4o, by uploading the official guide about examination question making for the registered salesperson examination as a PDF file,²² without any additional prompts or instructions.

Both models were accessed through the web interface (https://chatgpt.com) between 15 July and 20 July 2025. Web-browsing capabilities were disabled for both models. A zero-shot approach²³ was employed with no prompt engineering or additional instructions to guide the models, except for the GPTs, for which only the official guidelines were uploaded. For each question, the original Japanese text and its corresponding answer options were directly entered in a new conversation to avoid contextual carryover between questions. After the model generated its response, we immediately recorded it and manually extracted the final answer label for comparison with the official answer key.

Statistical analysis

The number and proportion of correct answers were calculated for each model. Performance differences between the GPT-4o and GPTs were compared using McNemar's test for paired categorical data, performed with R software (version 4.4.2; R Foundation for Statistical Computing, Vienna, Austria). A p-value <0.05 was considered statistically significant.

Results

Table 1 shows that the overall accuracy was 78.40% (95% CI: 75.46–81.14) for GPT-4o and 92.36% (95% CI: 90.35–94.07) for the GPTs, with this difference being statistically significant (p < 0.001). Domain-specific analysis showed that the GPTs significantly outperformed GPT-4o across all categories (p < 0.05). Both models achieved their highest accuracy in human body functions and pharmaceuticals, with 92.14% (95% CI: 86.38–96.01) for GPT-4o and 97.86% (95% CI: 93.87–99.56) for the GPTs. In contrast, pharmaceutical-affairs related laws and regulatory systems showed the lowest accuracy, with 62.14% (95% CI: 53.56–70.20) for GPT-4o and 86.43% (95% CI: 79.62–91.63) for the GPTs.

Table 1.

Comparison of GPT-4o and GPTs (customized version of GPT-4o) (correct response rate of question category).

Question category	Question (n = 838)	GPT-4o correct response rate (%; 95% CI)	GPTs correct response rate (%; 95% CI)	p value
All questions	838	78.40 (75.46–81.14)	92.36 (90.35–94.07)	<0.001
Common characteristics and basic knowledge of pharmaceuticals	140	87.86 (81.27–92.76)	94.29 (89.05–97.50)	0.016
Human body functions and pharmaceuticals	140	92.14 (86.38–96.01)	97.86 (93.87–99.56)	0.013
Major pharmaceuticals and their actions	279	77.42 (72.06–82.19)	92.83 (89.15–95.57)	<0.001
Pharmaceutical affairs related laws and regulatory systems	140	62.14 (53.56–70.20)	86.43 (79.62–91.63)	<0.001
Proper use and safety measures for pharmaceuticals	139	73.38 (65.22–80.51)	89.93 (83.68–94.38)	<0.001

Discussion

As a feasibility study, this research explored the potential of LLMs to support self-medication by verifying the practicality of a ChatGPT customized using information from the Japanese Registered Salesperson Examination. In terms of overall performance, both models exceeded the examination passing criterion of 70%,²⁴ with the GPTs achieving 92.36% and GPT-4o achieving 78.40%. In addition, the GPTs showed improvements across all domains, with the largest gains observed in domains where general LLM performance was relatively limited.

In the healthcare field, the performance of general purpose LLMs is being actively evaluated.^2–7 Recent studies have explored performance improvements through domain-specific customization.²⁵ Gorelik et al. reported that a customized GPT integrating multiple guidelines for the management of pancreatic cystic lesions produced correct recommendations in 87% of 60 clinical scenarios, achieving performance comparable to that of expert clinicians.⁹ Kiyomiya et al. showed that a customized ChatGPT-4 enriched with package insert information outperformed non-customized models in medication counseling for OTC drugs.²⁶ These findings demonstrate that embedding domain-specific knowledge can enhance LLMs performance, which supports the approach taken in our study.

Notably, while these studies combined explicit prompt instructions with uploaded knowledge,^9,25,26 our study achieved substantial performance gains solely by uploading the official guidelines. The simplicity of this approach contrasts with methods requiring more advanced technical interventions, such as fine-tuning or RAG, and makes LLMs customization accessible to non-technical users.

In our study, both models showed high accuracy in human body functions and pharmaceuticals domain. The GPTs achieved 97.86%, which was the highest across all domains. This result suggests that knowledge in the human body functions and pharmaceuticals domain may be broadly standardized in public sources. Such standardization would contribute to the ability of LLMs to readily learn the knowledge.

The superior accuracy of the GPTs suggests that direct reference to the official guidelines provides important benefits. The guidelines offer consistent terminology and definitions that may improve alignment with the knowledge required by the examination.

By contrast, the lowest accuracies were observed in pharmaceutical-affairs related laws and regulatory systems domain. Similar trends have been reported for other Japanese healthcare examinations. Takeshita et al. found low performance in law-related content on an oral and maxillofacial radiology board examination.²⁷ Goto et al. reported a 25% accuracy rate for GPT-4o in questions related to relevant laws and regulations on a radiography certification examination.²⁸ These observations suggest that Japanese legal content represents a common challenge for general LLMs. The challenge may be due to the relative scarcity of Japanese legal texts in training data and the complexity of legal terminology. However, even in this most challenging domain, the GPTs significantly improved accuracy from 62.14% with GPT-4o to 86.43% (p < 0.05). This finding suggests that appropriate incorporation of reference materials has the potential to address model limitations and achieve task-specific improvements. The improvement in this domain further suggests that when baseline LLM performance is limited, possibly due to insufficient training data, the provision of appropriate reference materials may be particularly effective, potentially contributing to the largest performance gains observed in the regulatory domain.

The major pharmaceuticals and their actions domain are considered to be directly related to knowledge essential for appropriate medication selection by the general public. In our study, the GPTs improved accuracy in this domain substantially from 77.42% to 92.83% (p < 0.05). Liu et al. discussed that the GPTs can be configured to prioritize information from uploaded reference materials, thereby enhancing the accuracy and trustworthiness of responses.²⁵ Similar to the discussion by Liu et al., our result may be attributed to the prioritization of the uploaded official guidelines by the GPTs, which likely enhanced the reliability and factual consistency of the responses.

This study has several limitations. First, this study did not include direct comparisons of examination performance with registered salespersons. While we used the passing criterion as a reference for baseline competency, we did not compare the models’ performance with actual examination scores of registered salespersons, which would provide a more direct assessment of relative knowledge levels. Second, this study evaluates examination performance as an indirect indicator of knowledge for self-medication support, rather than directly assessing the quality of recommendations or user safety in real-world settings. The examination scores primarily reflect foundational knowledge related to safety, including contraindications, dosage and administration, and regulatory requirements. However, they do not directly measure the ability to respond appropriately to individual user circumstances in actual self-medication situations, such as considering patient-specific factors, providing personalized advice, or recognizing complex clinical scenarios that require professional consultation. Third, this study used a single official guideline as the reference material for customization. While this guideline covers core knowledge required for the examination, it may not comprehensively address all OTC drug-related information, potentially resulting in knowledge gaps or biases in the customized GPTs. Fourth, this study has methodological limitations regarding the evaluation approach. Our evaluation was limited to GPT-4o, and performance characteristics may differ with newer models such as GPT-5.2 or other LLMs (e.g., Claude, Gemini, and Llama). In addition, each question was tested only once, whereas testing each question multiple times would help assess the consistency and reliability of model responses. Also, our study only evaluated text-based questions and responses. Current multimodal LLMs can recognize images,^29–31 and it is possible for these models to provide appropriate guidance by analyzing images of OTC drug packages, potentially enhancing self-medication support. Additionally, all inputs were in Japanese. Song et al. have reported that answer accuracy differs depending on language even for the same questions.³² Therefore, it is uncertain whether similar performance enhancements would generalize to other languages. Furthermore, creating customized GPTs requires a paid subscription plan, which could limit accessibility for some potential users and should be considered when evaluating practical implementation. Finally, our findings are based on the Japanese Registered Salesperson Examination and Japanese OTC regulations, which may limit generalizability to other countries where OTC drug classifications, regulatory frameworks, and healthcare systems differ significantly. In addition, as this is a feasibility study focused on evaluating minimal customization approaches, we did not address ethical, safety, and regulatory considerations that would be essential before implementing such systems in real-world consumer self-medication settings. Future research must carefully evaluate these concerns.

Conclusion

This study demonstrates the feasibility of using minimally customized LLMs for self-medication support. We found that basic customization involving only the upload of official guidelines for the registered salesperson examination led to accuracy improvements. These findings suggest that even users without programming skills or advanced prompt engineering expertise can construct high-accuracy, task-specialized LLMs by appropriately selecting and uploading reference materials.

However, examination performance represents only an indirect indicator of knowledge for self-medication support. Future research should evaluate LLM performance in real-world scenarios reflecting the complexity of individual patient circumstances, conduct user studies to assess usability and acceptability, and address ethical and safety concerns before implementation in consumer self-medication settings.

Footnotes

ORCID iDs

Shota Okazaki

Yuichi Mine

Ethical considerations

This article does not contain any data from humans.

Author contributions

Conceptualization: SO and YM. Data curation: SO. Formal analysis: SO and YM. Investigation: SO and YM. Methodology: SO, YM, ST and TM. Writing-original draft: SO. Writing-review and editing: YM, ST and TM. All authors have read and agreed to the published version of the manuscript.

Funding

The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by Collaborative Research Fund of Sapporo City University.

Declaration of conflicting interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Data availability

The data that support the findings of this study are available from the corresponding author upon reasonable request.

References

OpenAI. Introducing ChatGPT. (2022). https://openai.com/blog/chatgpt/ (accessed 12 October 2025).

Liu

Okuhara

Dai

, et al. Evaluating the effectiveness of advanced large language models in medical knowledge: a comparative study using Japanese national medical examination. Int J Med Inform 2025; 193: 105673.

Wang

Dou

, et al. Performance and exploration of ChatGPT in medical examination, records and education in Chinese: pave the way for medical AI. Int J Med Inform 2023; 177: 105173.

Mine

Okazaki

Taji

, et al. Benchmarking multimodal large language models on the dental licensing examination: challenges with clinical image interpretation. J Dent Sci 2025; 20: 2427–2435.

Nguyen

Dang

Nguyen

, et al. Accuracy of latest large language models in answering multiple choice questions in dentistry: a comparative study. PLoS One 2025; 29: e0317423.

Jin

Kim

. Performance of GPT-3.5 and GPT-4 on the Korean pharmacist licensing examination: comparison study. JMIR Med Educ 2024; 10: e57451.

Wang

Shen

Chen

, et al. Performance of ChatGPT-3.5 and ChatGPT-4 in the Taiwan national pharmacist licensing examination: comparative evaluation study. JMIR Med Educ 2025; 11: e56850.

OpenAI. Introducing GPTs. (2023). https://openai.com/index/introducing-gpts/ (accessed 12 October 2025).

Gorelik

Ghersin

Arraf

, et al. Using a customized GPT to provide guideline-based recommendations for management of pancreatic cystic lesions. Endosc Int Open 2024; 12: E600–E603.

10.

Collins

Black

Rarey

. Introducing AnatomyGPT: a customized artificial intelligence application for anatomical sciences education. Clin Anat 2024; 37: 661–669.

11.

Aşar

İpek

Bi˙lge

. Customized GPT-4V(ision) for radiographic diagnosis: can large language model detect supernumerary teeth? BMC Oral Health 2025; 25: 56.

12.

Nomura

Kitagawa

Yuda

, et al. Medicine reclassification processes and regulations for proper use of over-the-counter self-care medicines in Japan. Risk Manag Healthc Policy 2016; 9: 173–183.

13.

Ministry of Health, Labour and Welfare. Annual Health, Labour and Welfare Report 2007-2008. (2008). https://www.mhlw.go.jp/english/wp/wp-hw2/ (accessed 12 October 2025).

14.

Hokkaido Prefectural Government. Registered salesperson exam past questions. (2025). https://www.pref.hokkaido.lg.jp/hf/iyk/iry/yakuji/tourokuhannbaishashikenmondai.html (accessed 15 July 2025).

15.

Ibaraki Prefectural Government. Registered salesperson exam past questions. (2025). https://www.pref.ibaraki.jp/hokenfukushi/yakumu/yakuji/yakumu/yakuji/toroku/26toroku.html (accessed 16 July 2025).

16.

Tokyo Metropolitan Government. Registered salesperson exam past questions. (2025). https://www.hokeniryo.metro.tokyo.lg.jp/anzen/iyaku/tourokushiken/R6shiken (accessed 16 July 2025).

17.

Toyama Prefectural Government. Registered salesperson exam past questions. (2025). https://www.pref.toyama.jp/1208/sangyou/shikaku/shiken/kj00019694/r6touhan.html (accessed 17 July 2025).

18.

Union of Kansai Governments. Registered salesperson exam past questions. (2025). https://www.kouiki-kansai.jp/koikirengo/jisijimu/shikakumenkyo/touroku/7607.html (accessed 17 July 2025).

19.

Shimane Prefectural Government. Registered salesperson exam past questions. (2025). https://www.pref.shimane.lg.jp/medical/yakuji/yakuji/yakuji_info/toroku_hanbaisha/hanbaisha_shiken_monndai_seikai.html (accessed 18 July 2025).

20.

Fukuoka Prefectural Government. Registered salesperson exam past questions. (2024). https://www.pref.fukuoka.lg.jp/contents/r06tourokuhanbaisyashikenmondai.html (accessed 20 July 2025).

21.

OpenAI. Introducing GPT-4o and more tools to ChatGPT free users. (2024). https://openai.com/index/gpt-4o-and-more-tools-to-chatgpt-free/ (accessed 25 January 2026).

22.

Ministry of Health, Labour and Welfare. Fiscal year 2024 edition guide about examination question making (2024 April). (2025). https://www.mhlw.go.jp/stf/seisakunitsuite/bunya/0000045312/index_00001.html (accessed 15 July 2025).

23.

Brown

Mann

Ryder

, et al. Language models are few-shot learners. Adv Neural Inf Process Syst 2020; 33: 1877–1901.

24.

Ministry of Health, Labour and Welfare. About registered salesperson examination. (2025). https://www.mhlw.go.jp/stf/seisakunitsuite/bunya/0000082514.html (accessed 25 January 2026).

25.

Liu

. Custom GPTs enhancing performance and evidence compared with GPT-3.5, GPT-4, and GPT-4o? A study on the emergency medicine specialist examination. Healthcare (Basel) 2024; 12: 1726.

26.

Kiyomiya

Aomori

Ohtani

. Medication counseling for OTC drugs using customized ChatGPT-4: comparison with ChatGPT-3.5 and ChatGPT-4o. Digit Health 2025; 11: 20552076251323810.

27.

Takeshita

Kawazu

Hisatomi

, et al. Performance assessment of ChatGPT for the board qualification examination of the Japanese Society for Oral and Maxillofacial Radiology. Technol Knowl Learn. Epub ahead of print 07 August 2025. DOI:10.1007/s10758-025-09891-1.

28.

Goto

Shiraishi

Okada

. Performance evaluation of GPT-4o and o1-preview using the certification examination for the Japanese ‘operations chief of radiography with X-rays’. Cureus 2024; 16: e74262.

29.

Lim

DYZ

Tan

JRY

, et al. Vision-language large learning model, GPT4V, accurately classifies the Boston Bowel Preparation Scale score. BMJ Open Gastroenterol 2025; 12: e001496.

30.

Bereuter

Geissler

Klimova

, et al. Benchmarking vision capabilities of large language models in surgical examination questions. J Surg Educ 2025; 82: 103442.

31.

Mine

Iwamoto

Okazaki

, et al. Challenges and limitations of multimodal large language models in interpreting pediatric panoramic radiographs. Int J Paediatr Dent 2026; 36: 74–80.

32.

Song

Lee

. Comparative analysis of the response accuracies of large language models in the Korean national dental hygienist examination across Korean and English questions. Int J Dent Hyg 2025; 23: 267–276.

A comparison of customized GPTs and GPT-4o for self-medication knowledge: A feasibility study using the Japanese registered salesperson examination

Abstract

Objective

Methods

Results

Conclusion

Keywords

Introduction

Materials and methods

Dataset

Customized GPTs and evaluation procedure

Statistical analysis

Results

Discussion

Conclusion

Footnotes

ORCID iDs

Ethical considerations

Author contributions

Funding

Declaration of conflicting interests

Data availability

References