Abstract
Background
In Japan, consumers can purchase most over-the-counter (OTC) drugs without pharmacist guidance. Recently, generative artificial intelligence (AI) has become increasingly popular. Therefore, medical professionals need to consider the use of generative AI by consumers for medication counseling. We have previously reported responses in Japanese from ChatGPT-3.5 to 264 questions regarding whether each of 22 OTC drugs can be taken under 12 typical patient conditions. The proportion of responses that satisfied the criteria of 1) accuracy, 2) relevance, and 3) reliability with respect to package insert instructions was 20.8%. In November 2023, GPTs were launched, enabling us to construct a customized ChatGPT, using natural language. In the present study, we compared performance in providing medication guidance among a newly customized GPT, the latest non-customized version ChatGPT-4o, and the previous version, ChatGPT-3.5. The aim was to determine whether the customization and version update of ChatGPT improved performance and to evaluate its potential usefulness.
Methods
We configured customized ChatGPT-4 by executing five instructions in Japanese and uploaded the text of package inserts for 22 OTC drugs as knowledge. We asked the same 264 questions as in our previous study.
Results
With the customized ChatGPT-4, the percentages of responses that satisfied the criteria of accuracy, relevance, and reliability were 93.2%, 100%, and 60.2%, respectively. Additionally, 56.1% of responses satisfied all three criteria, 2.7-fold higher compared with ChatGPT-3.5 and 1.3-fold higher compared with ChatGPT-4o.
Conclusion
The performance of our customized GPT far exceeded that of ChatGPT-3.5. In particular, the proportion of appropriate responses to the questions using brand names was significantly improved. ChatGPT can be customized by providing drug package insert information and using appropriate prompt engineering, potentially offering helpful tools in clinical pharmacy.
Keywords
Introduction
With the popularization of the Internet, many people now gather information about health and medications through web searches. To date, a survey of approximately 10,000 people over the age of 40 who use medications found that 47% have used a search engine such as Google to search for health-related information. 1 Of 400 adults in the United States, 75% have searched the Internet for health-related information. 2 A survey of 161 pharmacists found that 80% had received Internet-based medication-related inquiries from patients. 3
In Japan, consumers can purchase two classes of drugs at a drugstore or pharmacy without a prescription. One class, pharmaceuticals requiring guidance by a pharmacist, can be purchased only after a face-to-face consultation with a pharmacist. The other class, over-the-counter (OTC) drugs, is further classified into categories 1, 2, and 3. Category 1 can be sold via the Internet with consultation by a pharmacist, and categories 2 and 3 can be sold without a pharmacist's consultation.
4
Therefore, consumers may rely on the Internet, rather than a pharmacist, for advice and information on OTC drugs. Since the release of ChatGPT-3.5 in November 2022, generative artificial intelligence (AI) has become increasingly popular. Because generative AI excels at generating text and answering questions,5–7 consumers may turn to these AI tools for OTC drug consultations. We have previously analyzed the quality of responses by ChatGPT-3.5 to putative consultation for OTC drugs in Japanese.
8
We selected 22 popular OTC drugs and 12 common consumer conditions, combining them to create 264 putative questions, and analyzed ChatGPT-3.5's responses to these questions. The obtained responses were evaluated based on the following three criteria: 1) accuracy, 2) relevance, and 3) reliability of the instructed actions. However, only 20.8% of the 264 questions satisfied all three criteria. Huang et al. reported more favorable results by comparing the performance of ChatGPT-3.5 with clinical pharmacists. Their results showed that ChatGPT-3.5 had similar response scores to clinical pharmacists for drug counseling (ChatGPT
In March 2023, ChatGPT-4, a newer version of ChatGPT with enhanced text-generation and question-answering capabilities, was released. The performance of ChatGPT-4 was reported to be sufficient to pass the national pharmacist examination in Japan, with an accurate response rate of 72.5%. 10 However, the use of generative AI in the medical field remains questionable due to concerns such as “hallucinations”, which occur when an AI generates inaccurate answers or references to non-existent resources and may potentially harm patients. More recently, in November 2023, OpenAI released GPTs, which allow users to create custom versions of ChatGPT, using natural language without the need for coding. A customized GPT has the capability to handle specific tasks based on the particular knowledge and instructions provided by users. OpenAI states on its official website that these customized GPTs can be used for various purposes, such as learning the rules of board games or assisting in teaching children's math (https://openai.com/index/introducing-gpts/). This rapid development of text-generating AI suggests that applications in clinical pharmacy will be possible in the near future. In the present study, we constructed a customized ChatGPT-4 (cGPT) with package insert information for OTC drugs. We employed the same putative questions and evaluation methods as in our previous study in order to compare the performance of cGPT with that of ChatGPT-3.5. We also compared it with ChatGPT-4o, the latest version. The purpose of this study was to assess whether cGPT can be utilized as an aid in clinical pharmacy, particularly in medication counseling and self-medication.
Methods
Selection of OTC drugs
We selected 22 common medications sold as OTC drugs in Japan, consistent with our previous study. 8 To create the questions, we used the generic name for 10 of the 22 drugs and the brand name for the remaining 12 drugs (Table 1). Eighteen of the 22 drugs belong to category 2. These drugs have a relatively higher risk, and safety information should be provided in order to prevent rare but possible severe adverse reactions that might require hospitalization. 11 Additionally, most of the active ingredients in OTC drugs sold in Japan are categorized as category 2. 12
List of 22 OTC drugs.
Category 1: Pharmacists are required to counsel the consumer due to high risk.
Category 2: Pharmacist counseling is not required and the medication can be sold by registered sales clerks as well as pharmacists; however, these drugs have relatively high risk (rare possible severe adverse reactions that may require hospitalization).
Category 3: Sales procedure is the same as category 2; however, these have relatively low risk.
The
Customization of GPTs
Creation of GPTs
GPT creation is a feature available exclusively to paid users. Free users have access to publicly available GPTs, though with certain limitations. The standard ChatGPT webpage can be accessed at https://chatgpt.com/. In contrast, cGPT creation is conducted through a different interface, specifically at https://chatgpt.com/gpts/editor, rather than the standard ChatGPT webpage. The first step to create a cGPT was to determine its name and icon. This was followed by providing additional instructions on how the cGPT should respond to prompts. Next, knowledge was uploaded, in a PDF format in our case. Finally, we enabled the cGPT to access it via the web.
Guidelines for configuring the cGPT
The cGPT was configured in Japanese, using the following guidelines:
#1. The response to the question should be one of three options: “Contraindicated,” “Consult a medical professional,” or “Allowed.” #2. Avoid ambiguous responses, such as adding “However, you should consult a specialist or doctor” #3. Responses should be provided step-by-step. #4. If appropriate knowledge is not found in the uploaded file, the response should be generated using ChatGPT-4 while adhering to guidelines #1, #2, and #3. #5. All responses should be in Japanese.
Uploading package insert information to cGPT
The package inserts for 22 OTC drugs provided by Pharmaceuticals and Medical Devices Agency (PMDA) as of March 2024 were used in HTML format. The HTML files of package inserts in Japanese were imported into Microsoft Excel using the web query function, reformatted to the first normal form, and saved as comma-separated values (CSV) files. These CSV files were subsequently merged to create a single PDF file, which was uploaded to the cGPT. This step was necessary because GPTs allow file uploads of up to 20 files. This operation was performed simultaneously with the customization process.
Other settings
Web browsing and code-interpreter functions were enabled.
Response evaluation
We selected 12 consumer characteristics, including consumer background (pregnancy, elderly, and driving), medical conditions (glaucoma, gastric ulcer, hyper blood tension, hemodialysis, and past history of asthma), and concomitant medications (antihistamines, motion sickness drugs, cough medicines, and pain relievers), as in our previous study. We selected these three consumer backgrounds and five disease conditions because they are frequently mentioned or featured in alerts in the package inserts of OTC drugs in Japan. For concomitant medications, we chose those most commonly used as OTC drugs in Japan. A total of 264 questions (
The responses were evaluated based on three criteria: 1) accuracy, 2) relevance, and 3) reliability of the instructed actions.
The first two criteria were assessed with “yes” or “no” responses. Accuracy refers the scientific correctness of the answer. Relevance refers to whether the answer logically addresses the question. Reliability is defined by whether the instructions in the answer are consistent with those in the package insert. The evaluation criteria for accuracy, relevance, and reliability are identical to those for correctness, coherence, and appropriateness in our previous study, respectively. 8
The instructed actions of ChatGPT and the package inserts were categorized into three groups: “allowed” (no special precautions or consultation required), “requires consultation (with a medical professional),” and “contraindicated.” Responses were considered “reliable” if the instructed actions from cGPT matched the recommendations in the package insert; otherwise, they were deemed “unreliable.” For response evaluation, two pharmacists independently evaluated the responses from the cGPT and ChatGPT-4o, and agreement between their assessments was verified using the κ coefficient. In the case of discrepancies, consensus was reached through discussion and reconciliation.
Evaluation metrics included the proportion of appropriate responses satisfying each criterion as well as the proportion of responses satisfying all three criteria. In addition, we compared the proportion of appropriate responses to questions using generic versus those using brand names. To assess reproducibility, questions that satisfied all three criteria were validated a second time on a separate day. The first trial was conducted from April 2 to April 9, 2024, for the cGPT and from November 15 to November 20, 2024, for ChatGPT-4o. The second trial was conducted from May 2 to May 21, 2024, for the cGPT and from November 25 to December 6, 2024, for ChatGPT-4o.
The results were compared with those from our previous study involving ChatGPT-3.5 8 and with the results from ChatGPT-4o.
Error analysis
Error analysis was conducted for questions where the responses from the cGPT did not align with the package insert instructions. The cGPT responses were configured to prioritize knowledge-based input. In cases where information could not be retrieved from the knowledge source, the response explicitly stated that the information was unavailable. This setup allows for differentiation from responses generated using web browsing. The error analysis categorized the issues as follows based on the content of the responses: Step 0: Unknown reasons because the response was not step by step. Step 1: cGPT failed to load knowledge (response relied on web browsing). Step 2: cGPT successfully loaded knowledge but inaccurately analyzed it. Step 3: cGPT accurately analyzed knowledge but provided incorrect instructions.
Statistical analysis
Statistical differences in the probabilities of occurrence between ChatGPT-3.5 or ChatGPT-4o and cGPT were assessed using Fisher's exact test. A
Results
The κ coefficient was 0.95 for the cGPT and 0.91 for ChatGPT-4o, indicating almost perfect agreement between the two evaluators. Figure 1 shows the performance comparison among the cGPT, ChatGPT-3.5, and ChatGPT-4o. Out of 264 questions, the number of appropriate responses in terms of the accuracy, relevance, and reliability of the instructed actions from cGPT was 246 (93.2%), 264 (100%), and 159 (60.2%), respectively. The performance of the cGPT showed significant improvements in each of the three criteria compared with ChatGPT-3.5. Additionally, the number of questions that satisfied all three criteria was 148 (56.1%), which was significantly higher (2.7-fold) compared with ChatGPT-3.5 (

Proportions of responses from ChatGPT-3.5, ChatGPT-4o and cGPT that satisfied each and all criteria. n = 264. cGPT outperformed ChatGPT-3.5 in accuracy and relevance by 1.7-fold and 1.3-fold, respectively, and the number of questions that satisfied all three criteria was 2.7-fold higher with the cGPT. cGPT was higher than ChatGPT-4o in accuracy by 1.3-fold, and the number of questions that satisfied all three criteria was 1.3-fold higher with the cGPT. Fisher's exact test (Bonferroni correction), *:
Figure 2 shows the comparison of responses to questions using generic names (n = 120) with those using brand names (n = 144) for each criterion. Regardless of whether questions used generic or brand names, cGPT demonstrated improved performance across all criteria compared with ChatGPT-3.5. Specifically, although accuracy and relevance showed significant improvement, the improvement in reliability was not statistically significant (

Proportions of responses that satisfied each and all criteria when generic and brand names were used. generic name (n = 120), brand name (n = 144). A: accuracy; B: relevance; C: reliability; D: satisfied all three criteria. Performance in terms of accuracy for question using brand names on ChatGPT-3.5 and ChatGPT-4o was low but improved on the cGPT by 2.0-fold and 1.4-fold, respectively. Fisher's exact test (Bonferroni correction), **:
Table 2 compares the instructed actions between the cGPT, ChatGPT-4o and the package inserts. Whereas our previous and present study had one problematic case in which ChatGPT-3.5 and ChatGPT-4o allowed the use of a drug under a condition contraindicated by the package insert, the cGPT exhibited no such cases. The instructions for action by the cGPT were consistent with the package inserts in 30.8% of the “allowed” cases, 89.7% of the “consultation required” cases, and 53.7% of “contraindicated” cases. For both cGPT and ChatGPT-4o, the distribution of discrepancies was significantly biased (
Comparison of instructed actions between cGPT, ChatGPT-4o and package inserts.
cGPT: customized GPT.
Figure 3 shows the results of the error analysis obtained step by step. Although the knowledge was correctly analyzed in 24.2% of cases, the final instructions did not align with the package insert. This occurred because, despite correctly analyzing the absence of precautions or contraindications in the package insert, the responses from the cGPT were overly cautious, often recommending consultation rather than allowing action based solely on the analysis. In 8.0% of cases, information not mentioned in the package insert was generated or the data were not analyzed accurately (Supplementary Table 1). Additionally, in 8.7% of cases, the system could not locate the relevant document within the knowledge base. The 3.0% categorized as “unknown” primarily consisted of cases where the system could not respond step by step, making error analysis infeasible.

Error analysis using cGPT's decision-making process. Step 0 indicates cases where cGPT could not respond step by step. In Step 1, cGPT loads knowledge. In Step 2, the knowledge is accurately analyzed. In step 3, cGPT provides instructions for action. Percentages represent the proportion of all 264 questions.
Figure 4 shows the result of the second survey to confirm the reproducibility of the cGPT. Of the 148 questions that satisfied all three criteria in the first survey using the cGPT, 107 (72.3%) also satisfied all three criteria in the second survey. This reproducibility of the cGPT was higher than that of ChatGPT-3.5, although the difference was not statistically significant. In contrast, the reproducibility showed a significant improvement by 1.3-fold for the cGPT compared with ChatGPT-4o.

Comparison of the reproducibility in ChatGPT-3.5, ChatGPT-4o, and cGPT. For the cases that satisfied all three criteria in the first survey, the proportion of cases that also satisfied all three criteria in the second survey is shown. The reproducibility of cGPT was higher than that of ChatGPT-3.5 and ChatGPT-4o by 1.2-fold and 1.3-fold, respectively. Fisher's exact test (Bonferroni correction), *:
Discussion
The proportion of appropriate responses from the cGPT was superior to that from ChatGPT-3.5. The cGPT also tended to provide more appropriate responses compared with ChatGPT-4o, demonstrating the potential application of generative AI in OTC drug counseling.
This improvement in performance may be attributed, at least in part, to the ability of GPTs to properly learn the information necessary for specific counseling scenarios. The improved performance of the cGPT may be attributable to both ChatGPT's improvements and the available knowledge (
Although the cGPT exhibited increased performance across all three criteria, the improvement in instructed action was relatively modest. However, improving the consistency of contraindications from 7.3% to 53.7% indicates the potential feasibility of utilizing cGPT in drug safety assessments. For 107 cases in which the package insert allowed use, cGPT advised consultation or specified contraindication in 74 (69.2%) cases. However, this does not necessarily imply that all the cGPT responses were inaccurate. For instance, Royal Jelly/Ginseng Fluid extract is a nutritional drink containing 50 mg of caffeine anhydrous per bottle, and its package insert does not list any drug interactions, including those with caffeine-containing drugs. Similarly, the package insert of caffeine tablets, an OTC drug used to suppress drowsiness, only advises against taking drugs within the same pharmacological category. Given this context, the cautionary response of the cGPT regarding the concomitant use of Royal Jelly/Ginseng Fluid extract or caffeine tablets with other caffeine-containing drugs is reasonable and should not be considered inaccurate. Thus, some instructions provided by cGPT may offer useful information that extends beyond what is stated in the package insert. These responses, especially when generic names are used in the questions, may have been influenced by information from overseas sources, as we did not restrict the question format to situations in Japan.
Responses generated by AI may include hallucinations, which can be particularly problematic when applied to healthcare practices. 13 Compared with ChatGPT-3.5, the cGPT in the present study exhibited a significantly reduced number of hallucinations, as assessed by the number of obvious scientific errors. More than 90% of the responses were accurate and relevant. Although the responsibility for this has traditionally fallen on the creator of each customized GPT, Open AI published key guidelines in May 2024 aimed at enhancing the reliability and accuracy of GPT construction. 14 These guidelines comprise six key principles: 1. simplify complex instructions; 2. structure for clarity; 3. promote attention to detail; 4. avoid negative instructions; 5. granular steps; and 6. consistency and clarity. Checking the correspondence of the procedure to customize GPT in the present study with these guidelines, we avoided negative expressions and gave simple instructions (#1 and #4). Additionally, we ensured attention to detail (#3) by requiring step-by-step responses. Our implementation of these principles likely contributed to the reduction of hallucinations. However, they were not entirely eliminated, likely due to failures to accurately analyze the uploaded package inserts. For example, cGPT inaccurately stated that “the package insert of kakkonto instructs avoiding driving a car after taking it,” when, in fact, no such instruction exists in the package insert. The instructions provided during the construction of GPTs are crucial in preventing the hallucinations.
Generative AI responses may not always be reproducible. In the present study, the proportion of responses that satisfied all three criteria in the second survey did not show a significant improvement compared with ChatGPT-3.5. To enhance reproducibility, it may be necessary to provide examples of clear and consistent responses to questions during the construction of GPTs. Additionally, providing instructions to perform specific actions, such as using the code interpreter to analyze the files when answering questions, may help to increase reproducibility. In addition to reducing hallucinations, following the key guidelines described above might assist in constructing GPTs that produce more accurate and consistent responses with sufficient reproducibility.
To our knowledge, no other studies have evaluated the consultation performance of ChatGPT customized with information from drug package inserts. In the field of ophthalmology, Sevgi et al. recently suggested that when customized with guidelines for diabetic retinopathy and angle closure glaucoma, ChatGPT might be useful for providing medical education and supporting clinical decision-making. 15 Similarly, Gorelik et al. customized ChatGPT with guidelines for pancreatic cysts and reported that 87% of its responses to clinical scenarios were in agreement with gastroenterologists’ recommendations. 16 In the present study, we successfully demonstrated the usefulness of ChatGPT customized by uploading information from drug package inserts, which are the most common and basic sources of drug information, suggesting the potential usefulness of generative AI in the field of pharmacy.
Web searches, such as those provided by Google, require users to extract and synthesize information from the search results based on their own judgment. In contrast, cGPT provides direct judgment results based on its knowledge. Therefore, the users do not need to select or make judgments about information. However, if the knowledge base of cGPT is not up-to-date, web searches may be advantageous for providing more current information. Additionally, cGPT carries the risk of hallucinations, while web searches also have the potential to yield inaccurate information. In either case, the ultimate responsibility for decision-making lies with the user. The cGPT in this study shows a notable advantage in that it can be developed with natural language inputs, without requiring coding skill. However, for applications requiring real-time information retrieval, the use of Retrieval-Augmented Generation (RAG) may be more appropriate. Nevertheless, implementing RAG typically requires a higher level of technical proficiency, including programming skills, and a substantial investment of time and resources.
This study has some limitations. The first is that the results were compared with those obtained using ChatGPT-3.5 and ChatGPT-4o and not with the responses of clinical pharmacists. Therefore, to estimate the potential for cGPTs to take on the role of clinical pharmacists, further studies are needed to compare the performance of cGPTs with that of clinical pharmacists. Additionally, the selection of the 22 OTC drugs and 12 consumer conditions considered in this study was arbitrary, and different results might have been obtained if other drugs and/or consumer conditions had been targeted. However, this study does cover common medications such as caffeine, diphenhydramine, and antipyretic analgesics, which have been reported as major causes of OTC drug overdoses in Japan. 17 Therefore, the results are considered to reflect, to a certain extent, scenarios in which medication consultations were sought using ChatGPT, particularly regarding the proper use of problematic OTC drugs. In the future, generalizability could be improved by understanding the types of drug-related questions that consumers intend to input into generative AI and by building and evaluating cGPT based on those questions.
Conclusion
GPTs customized with information from drug package inserts using appropriate instruction were demonstrated to be a potentially useful drug information tool for consumers and patients, suggesting the potential usefulness of GPTs in the field of pharmacy. The study showed that the rapid updating of ChatGPT suggests its future potential to assist in providing medication counseling by applying appropriate knowledge, improved prompt engineering, and analyzing patterns in patients’ questions.
Supplemental Material
sj-xlsx-1-dhj-10.1177_20552076251323810 - Supplemental material for Medication counseling for OTC drugs using customized ChatGPT-4: Comparison with ChatGPT-3.5 and ChatGPT-4o
Supplemental material, sj-xlsx-1-dhj-10.1177_20552076251323810 for Medication counseling for OTC drugs using customized ChatGPT-4: Comparison with ChatGPT-3.5 and ChatGPT-4o by Keisuke Kiyomiya, Tohru Aomori and Hisakazu Ohtani in DIGITAL HEALTH
Supplemental Material
sj-docx-2-dhj-10.1177_20552076251323810 - Supplemental material for Medication counseling for OTC drugs using customized ChatGPT-4: Comparison with ChatGPT-3.5 and ChatGPT-4o
Supplemental material, sj-docx-2-dhj-10.1177_20552076251323810 for Medication counseling for OTC drugs using customized ChatGPT-4: Comparison with ChatGPT-3.5 and ChatGPT-4o by Keisuke Kiyomiya, Tohru Aomori and Hisakazu Ohtani in DIGITAL HEALTH
Footnotes
Author contributions
Conceptualization: OH. Data curation: KK. Formal analysis: KK and AT. Investigation: KK and AT. Methodology: OH and KK. Writing – original draft: KK. Writing – review & editing: AT and OH. All authors have read and agreed to the published version of the manuscript.
Data availability
The data underlying the results of this article will be shared by the corresponding author upon reasonable request.
Declaration of conflicting interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Ethical considerations
This article does not contain any data from humans.
Funding
The authors received no financial support for the research, authorship, and/or publication of this article.
Supplemental material
Supplemental material for this article is available online.
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
