Abstract
Objective
This feasibility study aimed to assess whether minimally customized large language models (LLMs) can achieve knowledge levels relevant for self-medication support by evaluating their performance on the Japanese Registered Salesperson Examination. Registered salespersons provide consumers with guidance on appropriate selection and safe use of over-the-counter drugs in Japan. We explored whether LLMs customized through simple reference materials upload, without requiring programming skills or advanced prompt engineering, could achieve comparable knowledge levels.
Methods
We used the 2024 Registered Salesperson Examination in Japan as a benchmark dataset, comprising 838 text-only multiple-choice questions across five domains. We created customized GPTs by uploading the official guide for examination question development as a PDF file to GPT-4o, without additional prompts or instructions. Both GPT-4o and the customized GPTs were evaluated using a zero-shot approach with web-browsing capabilities disabled.
Results
The overall accuracy was 78.40% for GPT-4o and 92.36% for the GPTs, with this difference being statistically significant (
Conclusion
This feasibility study demonstrates that minimal customization through reference materials upload can achieve performance improvements on knowledge-based assessments. However, examination performance represents only an indirect indicator of knowledge for self-medication support. Future research should evaluate LLMs performance in real-world case scenarios and assess practical utility and safety before implementation in consumer self-medication settings.
Introduction
Large language models (LLMs) became widely recognized following the release of ChatGPT in November 2022. 1 LLMs are rapidly influencing research, education, and professional practice in the healthcare field. For example, studies have been conducted to assess the performance of LLMs on national licensing examination questions in fields such as medicine,2,3 dentistry,4,5 and pharmacy.6,7 However, these studies have reported that LLM performance varies across domains, with some showing low accuracy rates. These findings suggest limitations of general LLMs in certain domains.2–7 To address these limitations, customizing LLMs for specific tasks or domains has gained attention.
In 2023, OpenAI introduced GPTs, which are custom versions of ChatGPT that can be created for specific purposes. 8 The introduction of GPTs has made lightweight customization more accessible, and users can create domain-tailored versions with embedded knowledge and instructions without requiring advanced techniques such as retrieval-augmented generation (RAG) or fine-tuning. Recent studies have demonstrated that customized LLMs can be applied in various healthcare domains, such as management of pancreatic cystic lesions, 9 specialized education in anatomical sciences, 10 and radiographic diagnosis in dentistry. 11 Although these approaches typically require specialized knowledge for selecting reference materials and designing prompts, they support the promise of customized LLMs in specialized domains, including clinical decision-making, professional training, and diagnostic support.
Beyond these specialized professional applications, customized LLMs hold potential as a digital health tool for consumer-oriented health support, for example in self-medication. Such applications could contribute to expanding the digital health community by broadening the scope of AI-assisted health support from professional clinical settings to consumer self-care.
Self-medication is becoming increasingly important around the world. It allows individuals to take responsibility for their own health and treat minor illnesses on their own. In Japan, the government has promoted self-care as a national strategy to control healthcare costs and extend healthy life expectancy through preventive activities and self-care initiatives. 12 As part of the policy framework, the registered salesperson system was established in 2008 to enhance accessibility to over-the-counter (OTC) medications. 13 Registered salespersons are certified professionals who are distinct from pharmacists. Even in the absence of a pharmacist, they are authorized to sell OTC drugs with relatively low risks of adverse effects and to provide usage guidance to consumers.
The guidance provided by registered salespersons may be an important source of knowledge for the general public in practicing self-medication. Therefore, LLMs with the knowledge of a registered salesperson could support the public in self-medication by helping users select appropriate OTC drugs, use them safely, and recognize when to consult healthcare professionals. To date, however, no attempts have been made to construct LLMs with such knowledge through minimal, non-technical methods that general users can perform. Given the potential benefits for public health, we considered it worthwhile to explore this possibility.
The aim of this feasibility study was to assess whether minimally customized LLMs can achieve knowledge levels relevant for self-medication support, for their future application in real-world self-medication scenarios. As noted above, general LLMs have shown variable performance across domains, with potential limitations in certain specialized areas. Customization of LLMs may help address these limitations. Healthcare professionals or subject matter experts with limited programming skills could potentially create customized LLMs by selecting and uploading appropriate reference materials. This study focuses on a minimal customization approach using only uploaded reference materials without additional prompts. To objectively evaluate knowledge required for OTC drug guidance, we compared the performance of GPTs customized with official examination guidelines and non-customized GPT-4o using the Japanese Registered Salesperson Examination.
Materials and methods
Dataset
This study used the 2024 Japanese Registered Salesperson Examination (RSE-2024) in Japan as a benchmark dataset for evaluating LLMs performance.14–20 The RSE-2024 is a nationally standardized qualification examination that assesses the knowledge necessary for registered salespersons to sell OTC drugs and provide usage guidance to consumers. The examination questions are publicly available on the official websites of prefectural governments in Japan. No examination content was reproduced in this paper. The questions were used solely for performance evaluation, and only the numerical results are reported. Therefore, no specific permission was required for this use. The examination consists of five domains: (1) common characteristics and basic knowledge of pharmaceuticals, (2) human body functions and pharmaceuticals, (3) major pharmaceuticals and their actions, (4) pharmaceutical-affairs related laws and regulatory systems, and (5) proper use and safety measures for pharmaceuticals. 21 The examination comprises 120 text-only multiple-choice questions. It includes 40 questions from major pharmaceuticals and their actions, and 80 questions from the other four domains, with 20 questions per domain. The 2024 examination was administered using seven distinct examination sets corresponding to seven regional blocks across Japan. We collected all seven sets from official public sources, capturing all examination questions administered throughout Japan.14–20 Two questions were officially withdrawn from scoring by the examination authority due to difficulty in determining the correct answers based on the official examination guidelines. 20 These questions were excluded from our analysis, resulting in a final dataset of 838 questions.
Customized GPTs and evaluation procedure
We evaluated performance on the RSE-2024 using GPT-4o (4o; OpenAI Global, San Francisco, CA, USA, released on May 13, 2024) and the GPTs. GPT-4o was selected as the baseline comparator because it was freely accessible to general users at the time of our study, which aligns with our focus on consumer-oriented self-medication support. 21 Customizing GPTs requires no programming skills, making it an accessible and intuitive task. 8 We created the GPTs, which is a customized version of GPT-4o, by uploading the official guide about examination question making for the registered salesperson examination as a PDF file, 22 without any additional prompts or instructions.
Both models were accessed through the web interface (https://chatgpt.com) between 15 July and 20 July 2025. Web-browsing capabilities were disabled for both models. A zero-shot approach 23 was employed with no prompt engineering or additional instructions to guide the models, except for the GPTs, for which only the official guidelines were uploaded. For each question, the original Japanese text and its corresponding answer options were directly entered in a new conversation to avoid contextual carryover between questions. After the model generated its response, we immediately recorded it and manually extracted the final answer label for comparison with the official answer key.
Statistical analysis
The number and proportion of correct answers were calculated for each model. Performance differences between the GPT-4o and GPTs were compared using McNemar's test for paired categorical data, performed with R software (version 4.4.2; R Foundation for Statistical Computing, Vienna, Austria). A
Results
Table 1 shows that the overall accuracy was 78.40% (95% CI: 75.46–81.14) for GPT-4o and 92.36% (95% CI: 90.35–94.07) for the GPTs, with this difference being statistically significant (
Comparison of GPT-4o and GPTs (customized version of GPT-4o) (correct response rate of question category).
Discussion
As a feasibility study, this research explored the potential of LLMs to support self-medication by verifying the practicality of a ChatGPT customized using information from the Japanese Registered Salesperson Examination. In terms of overall performance, both models exceeded the examination passing criterion of 70%, 24 with the GPTs achieving 92.36% and GPT-4o achieving 78.40%. In addition, the GPTs showed improvements across all domains, with the largest gains observed in domains where general LLM performance was relatively limited.
In the healthcare field, the performance of general purpose LLMs is being actively evaluated.2–7 Recent studies have explored performance improvements through domain-specific customization. 25 Gorelik et al. reported that a customized GPT integrating multiple guidelines for the management of pancreatic cystic lesions produced correct recommendations in 87% of 60 clinical scenarios, achieving performance comparable to that of expert clinicians. 9 Kiyomiya et al. showed that a customized ChatGPT-4 enriched with package insert information outperformed non-customized models in medication counseling for OTC drugs. 26 These findings demonstrate that embedding domain-specific knowledge can enhance LLMs performance, which supports the approach taken in our study.
Notably, while these studies combined explicit prompt instructions with uploaded knowledge,9,25,26 our study achieved substantial performance gains solely by uploading the official guidelines. The simplicity of this approach contrasts with methods requiring more advanced technical interventions, such as fine-tuning or RAG, and makes LLMs customization accessible to non-technical users.
In our study, both models showed high accuracy in human body functions and pharmaceuticals domain. The GPTs achieved 97.86%, which was the highest across all domains. This result suggests that knowledge in the human body functions and pharmaceuticals domain may be broadly standardized in public sources. Such standardization would contribute to the ability of LLMs to readily learn the knowledge.
The superior accuracy of the GPTs suggests that direct reference to the official guidelines provides important benefits. The guidelines offer consistent terminology and definitions that may improve alignment with the knowledge required by the examination.
By contrast, the lowest accuracies were observed in pharmaceutical-affairs related laws and regulatory systems domain. Similar trends have been reported for other Japanese healthcare examinations. Takeshita et al. found low performance in law-related content on an oral and maxillofacial radiology board examination.
27
Goto et al. reported a 25% accuracy rate for GPT-4o in questions related to relevant laws and regulations on a radiography certification examination.
28
These observations suggest that Japanese legal content represents a common challenge for general LLMs. The challenge may be due to the relative scarcity of Japanese legal texts in training data and the complexity of legal terminology. However, even in this most challenging domain, the GPTs significantly improved accuracy from 62.14% with GPT-4o to 86.43% (
The major pharmaceuticals and their actions domain are considered to be directly related to knowledge essential for appropriate medication selection by the general public. In our study, the GPTs improved accuracy in this domain substantially from 77.42% to 92.83% (
This study has several limitations. First, this study did not include direct comparisons of examination performance with registered salespersons. While we used the passing criterion as a reference for baseline competency, we did not compare the models’ performance with actual examination scores of registered salespersons, which would provide a more direct assessment of relative knowledge levels. Second, this study evaluates examination performance as an indirect indicator of knowledge for self-medication support, rather than directly assessing the quality of recommendations or user safety in real-world settings. The examination scores primarily reflect foundational knowledge related to safety, including contraindications, dosage and administration, and regulatory requirements. However, they do not directly measure the ability to respond appropriately to individual user circumstances in actual self-medication situations, such as considering patient-specific factors, providing personalized advice, or recognizing complex clinical scenarios that require professional consultation. Third, this study used a single official guideline as the reference material for customization. While this guideline covers core knowledge required for the examination, it may not comprehensively address all OTC drug-related information, potentially resulting in knowledge gaps or biases in the customized GPTs. Fourth, this study has methodological limitations regarding the evaluation approach. Our evaluation was limited to GPT-4o, and performance characteristics may differ with newer models such as GPT-5.2 or other LLMs (e.g., Claude, Gemini, and Llama). In addition, each question was tested only once, whereas testing each question multiple times would help assess the consistency and reliability of model responses. Also, our study only evaluated text-based questions and responses. Current multimodal LLMs can recognize images,29–31 and it is possible for these models to provide appropriate guidance by analyzing images of OTC drug packages, potentially enhancing self-medication support. Additionally, all inputs were in Japanese. Song et al. have reported that answer accuracy differs depending on language even for the same questions. 32 Therefore, it is uncertain whether similar performance enhancements would generalize to other languages. Furthermore, creating customized GPTs requires a paid subscription plan, which could limit accessibility for some potential users and should be considered when evaluating practical implementation. Finally, our findings are based on the Japanese Registered Salesperson Examination and Japanese OTC regulations, which may limit generalizability to other countries where OTC drug classifications, regulatory frameworks, and healthcare systems differ significantly. In addition, as this is a feasibility study focused on evaluating minimal customization approaches, we did not address ethical, safety, and regulatory considerations that would be essential before implementing such systems in real-world consumer self-medication settings. Future research must carefully evaluate these concerns.
Conclusion
This study demonstrates the feasibility of using minimally customized LLMs for self-medication support. We found that basic customization involving only the upload of official guidelines for the registered salesperson examination led to accuracy improvements. These findings suggest that even users without programming skills or advanced prompt engineering expertise can construct high-accuracy, task-specialized LLMs by appropriately selecting and uploading reference materials.
However, examination performance represents only an indirect indicator of knowledge for self-medication support. Future research should evaluate LLM performance in real-world scenarios reflecting the complexity of individual patient circumstances, conduct user studies to assess usability and acceptability, and address ethical and safety concerns before implementation in consumer self-medication settings.
Footnotes
Ethical considerations
This article does not contain any data from humans.
Author contributions
Conceptualization: SO and YM. Data curation: SO. Formal analysis: SO and YM. Investigation: SO and YM. Methodology: SO, YM, ST and TM. Writing-original draft: SO. Writing-review and editing: YM, ST and TM. All authors have read and agreed to the published version of the manuscript.
Funding
The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by Collaborative Research Fund of Sapporo City University.
Declaration of conflicting interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Data availability
The data that support the findings of this study are available from the corresponding author upon reasonable request.
