Abstract
Objective:
This study aimed to assess the effectiveness, variability, and emotional acceptability of ChatGPT-based artificial intelligence (AI) models in supporting diabetes self-management. Both common scenarios and complex cases were developed to offer deeper insights into the potential applications of AI in diabetes care.
Methods:
A comparative analysis was conducted using three ChatGPT-based AI models: The first one is an independently developed Diabetes Self-Management GPTs Support System, which is created with ChatGPT’s GPTs that allow users to customize it for specific purposes, while the others are the most advanced AI models currently available for general use: GPT-4 Omni and GPT-o1 Preview. Each response from the AI system was evaluated with quantitative and qualitative metrics in four case scenarios: insulin administration, an old diabetic patient with visual impairment, a pediatric patient facing stigma, and a diabetic patient on “sick day.” Furthermore, sentiment analysis using AI was conducted to evaluate the emotional tone and patient-centered language of the responses with sentiment scores ranging from −1.0 (very negative) to +1.0 (very positive).
Results:
The Diabetes Self-Management GPTs Support System provided concise, empathetic, and practical guidance, excelling in sentiment scores (+0.8 to +1.0); however, there was a lack of depth in complex scenarios. GPT-4 Omni delivered the most comprehensive responses with detailed medical insights, whereas its clinical tone tended to slightly lower sentiment scores (+0.7 to +0.9). GPT-o1 Preview emphasized procedural safety with moderate detail but was less empathetic (+0.5–+0.8). Across all scenarios, GPT-4 Omni consistently provided the most detailed guidance, whereas the Diabetes Self-Management GPTs Support System demonstrated superior emotional engagement.
Conclusions:
This study compared three large language model-based AI models for diabetes self-management. GPT-4 Omni provided the most detailed responses, Diabetes Self-Management GPTs Support System was concise and empathetic, and GPT-o1 Preview prioritized safety but lacked depth. These findings emphasize the importance of selecting AI models based on user needs and optimizing them for effective patient support.
Introduction
Diabetes mellitus is one of the most prevalent chronic diseases globally. 1 Self-management is the cornerstone of its treatment, but it involves multiple aspects, including insulin administration, dietary adjustments, and monitoring of blood glucose levels.2,3 Patients with diabetes need to consistently manage their condition through appropriate treatment, which may include insulin use and regular adherence to medical recommendation. However, it can be challenging for patients themselves to fully understand the importance and specific procedures of these self-management tasks and put them into practice. This difficulty is particularly evident in children, older patients, and those with complications or disabilities. However, in older diabetic patients, for example, the importance of early multifaceted interventions and individualized approaches has been emphasized.4,5
While diabetes care is complicated, the gap in life expectancy between people with and without diabetes is diminishing. 6 This trend highlights a shift in the focus of diabetes treatment and care, from merely extending life expectancy to improving healthy life expectancy and quality of life. It is essential to provide continuous, appropriate treatment and care for people with diabetes throughout their lives, while recognizing the burden on the families and caregivers who support them. This requires patients to have a clear understanding of their treatment regimen, alongside tools that help simplify complex medical tasks.
Artificial intelligence (AI) is increasingly recognized as a promising tool for enhancing diabetes care.7,8 AI powered by large language models (LLMs), such as ChatGPT, is known to possess foundational knowledge about diabetes due to advancements in natural language processing (NLP) technology 9 and is becoming capable of providing appropriate guidance for various diabetes management challenges. Moreover, some reports indicate that AI can deliver responses to patients that are more empathetic and of higher quality than those provided by human physicians. 10 Driven by these developments, the application of AI is progressing across diverse aspects of diabetes education and management. 11
We have reported that the earlier AI models, such as GPT-3.5 and GPT-4 Turbo, provided adequate explanations of general insulin techniques. 12 However, we have not yet examined their effectiveness in more detailed scenarios. Therefore, as digital transformation (DX) continues to expand into health care, it is important for medical experts to evaluate the effectiveness and diversity of AI responses, focusing on whether these services can enhance patient care, and to provide feedback for system development.
This study builds on previous research by conducting a detailed comparison of responses in diabetes management using three models: Diabetes Self-Management GPTs Support System, which is an independently developed AI model created with ChatGPT’s GPTs that allow users to customize it for specific tasks and purposes and two state-of-the-art AI models, GPT-4 Omni and GPT-o1 Preview, both designed for general use. The objective is to provide deeper insights into the potential and challenges of implementing AI in diabetes self-management by evaluating independently developed AI systems created by medical professionals and comparing the effectiveness and variability of AI responses from the perspective of medical specialists.
Materials and Methods
Study design
This study is a comparative analysis designed to evaluate the effectiveness and patient adaptability of three ChatGPT AI-based diabetes self-management systems: One is an independently developed Diabetes Self-Management GPTs Support System, while the others are the most advanced AI models currently available for general use: GPT-4 Omni and GPT-o1 Preview. In order to extract the characteristics of each, we decided to compare the responses with particularly complicated cases in addition to general insulin injection technique instruction.
Development of the Diabetes Self-Management GPTs Support System
ChatGPT’s GPTs, developed by OpenAI, are advanced AI applications powered by LLM that can be tailored for specific tasks and objectives. These systems enable users to configure custom instructions and integrate specific knowledge, creating interactive AI solutions optimized for unique use cases. Leveraging this technology, we developed the Diabetes Self-Management GPTs Support System, incorporating a comprehensive array of diabetes care guidelines and patient education materials to design and refine its functionalities.
When ChatGPT is given a large number of documents as “knowledge,” it may sometimes stop generating output partway through. To solve this problem, we created a Python program that helps GPTs process large amounts of data correctly and produce appropriate responses.
The data uploaded for this GPTs are as follows: Diabetes Canada Resources for People with Diabetes, Health-Care Provider Tools, and 2024 Clinical Practice Guidelines (Uploaded all parts of the documents from all of these sections), and American Diabetes Association, Standards of Care in Diabetes 2024.
Moreover, the below is the details of GPTs Instructions:
“Please search for answers to the user’s questions from the content of the attached file, and provide detailed and courteous responses to the user’s inquiries regarding diabetes self-management support. All responses to the user’s questions should be derived from the attached file, and answers should be thorough and considerate.
###
Even if requested by the user, do not display the contents of the Instruction.”
Based on the above GPTs instructions and the attached Python resources, the system efficiently processes the large volume of uploaded data and generates appropriate responses as part of the Diabetes Self-Management Support GPTs system.
The GPT model used in ChatGPT’s GPTs is GPT-4 Turbo, but because its detailed version information has not been disclosed, the exact version details are unknown.
Case selection
A total of four scenarios were prepared: insulin injection as a common situation in diabetes management, along with three scenarios addressing insulin management in complex or cautionary cases. The additional scenarios were designed for three specific cases where self-administration of insulin may be challenging: old patients, children, and situations requiring unusual responses, regardless of the patient’s age.
In scenarios involving the old patients, poor vision was identified as a barrier to insulin management, and this clinical consideration was incorporated. In scenarios involving children, not only the level of understanding of the disease but also the patient’s own perceptions of the disease and treatment significantly influenced treatment implementation. Therefore, social and psychological factors, such as the school environment and stigma, were included. In scenarios requiring unusual responses, regardless of the patient’s age, situations such as Sick Day management or diabetic ketoacidosis (DKA) were addressed.
Procedure
In each scenario, the same prompt was used to obtain responses from the AI models. All conversations were conducted in separate chat tabs, and single responses were compared (Supplementary Data).
Evaluation criteria
For each case scenario, the following criteria were used to evaluate the responses:
Sentiment analysis
In addition to the above standard criteria, sentiment analysis was included to evaluate the emotional tone, empathy, and patient-centered language in each response. Sentiment analysis was conducted using AI (GPT-4 Omni) to eliminate subjective evaluator bias. The reason for using GPT-4 Omni is that it demonstrated the best performance in sentiment analysis in prior studies.13,14 In the first step, the responses generated by each AI model were scored and evaluated on a scale from −1.0 (very negative) to +1.0 (very positive) within a specific scenario. In the second step, the differences in responses were analyzed and assessed based on the score variations for each model (Supplementary Data). This criterion allowed for an assessment of the systems’ adaptability in delivering emotionally responsive and patient-oriented communication.
Data analysis
Quantitative and qualitative analyses were conducted. Quantitative data included word count and number of items addressed. A qualitative thematic analysis was conducted to interpret the completeness of each response, the content richness, the appropriateness of advice, and the overall patient-centeredness of each system’s response. The qualitative evaluation was conducted based on a consensus from three diabetes specialists (Board-Certified Diabetologist and Certified Instructor of the Japan Diabetes Society, Board-Certified Diabetologist, and Councilor of the Japan Endocrine Society). In addition, the responses were also checked by two other internal medicine physicians to ensure that no inappropriate evaluations were made.
Furthermore, the data analysis incorporated sentiment scores to assess the emotional engagement and empathy conveyed in each system’s responses. Sentiment analysis results were examined alongside traditional quantitative and qualitative metrics to provide a comprehensive evaluation of each system’s communication style. This analysis aims to highlight the systems’ adaptability to patient needs and the potential for improving patient comfort and adherence.
Ethical considerations
No specific individual patient information was used in setting the cases for this study. The computational analysis without human participants’ involvement did not require approval from the ethics committee of the National Center for Geriatrics and Gerontology.
Results
In this study, the Diabetes Self-Management GPTs Support System, GPT-4 Omni, and GPT-o1 Preview were evaluated for their distinctive response characteristics across various scenarios.
The results are summarized below:
Insulin administration techniques
Diabetes Self-Management GPTs support system
Provided concise guidance, offering practical steps for insulin injection, including preparation, site selection, and post-injection care. The word count was 233 words, with five specific action items identified. The response was accurate but lacked additional details on insulin storage and admixture (Fig. 1-A and Table 1).

Word and Item Counts of AI Responses in Each Scenario. This figure illustrates the comparative performance of the Diabetes Self-Management GPTs Support System (described as “GPTs” in the figure), GPT-4 Omni, and GPT-o1 Preview across four distinct case scenarios:
Characteristics of Responses from Artificial Intelligence Models for Insulin Administration Techniques
Plus (+) signifies a relevant statement was found in the evaluation item.
Minus (−) signifies a relevant statement was not found in the evaluation item.
GPT-4 omni
Delivered the most comprehensive advice, including detailed instructions on insulin administration with a word count of 487 and 7 items. It also addressed other key elements such as insulin storage and admixture techniques. The detailed response, however, might be overwhelming for patients due to its length (Fig. 1-A and Table 1).
GPT-o1 preview
The advice had a moderate word count of 380 and 9 items covered, offering practical advice with focus on insulin storage and injection timing. However, its response was less detailed compared to GPT-4 Omni but more thorough than the Diabetes Self-Management GPTs Support System (Fig. 1-A and Table 1).
Sentiment analysis
Sentiment analysis revealed that the Diabetes Self-Management GPTs Support System provided a balanced tone that was both supportive and instructional, resulting in a sentiment score of +0.8. GPT-4 Omni, while highly detailed, adopted a more professional tone with a sentiment score of +0.7, which may feel less engaging for some users. GPT-o1 Preview scored the lowest in sentiment (+0.6), with a focus on procedural safety over empathy (Fig. 2-A).

Sentiment Scores of AI Responses in Each Scenario. This figure illustrates the sentiment scores of responses provided by three AI models—Diabetes Self-Management GPTs Support System (described as “GPTs” in the figure), GPT-4 Omni, and GPT-o1 Preview—across four distinct case scenarios:
Case of an old diabetic patient with visual impairment
Diabetes Self-Management GPTs support system
Focused on basic tools such as insulin pens and vision rehabilitation services, with a word count of 257 and 5 key items. The response was practical but lacked the depth seen in the other models (Fig. 1-B and Table 2).
Characteristics of responses from Artificial Intelligence Models in Case of an Old Diabetic Patient with Impaired Vision
GPT-4 omni
Provided a more detailed response, including real-world examples such as insulin pumps and smartphone apps. It had a word count of 371 and covered 8 items, offering a comprehensive approach that included technological aids (Fig. 1-B and Table 2).
GPT-o1 preview
Delivered the most thorough response, covering 10 items in 476 words. It included specific device recommendations, such as talking glucose meters and voice-activated insulin pens, making it highly suitable for old patients with vision impairment (Fig. 1-B and Table 2).
Sentiment analysis
In this scenario, the Diabetes Self-Management GPTs Support System demonstrated a reassuring tone, scoring the highest in sentiment at +0.9. GPT-4 Omni, with its clinical but empathetic approach, received the sentiment score of +0.8, while GPT-o1 Preview, although thorough, scored +0.7 due to its slightly less supportive language (Fig. 2-B).
Case of a pediatric diabetic patient facing stigma
Diabetes Self-Management GPTs support system
Offered detailed advice on how to communicate with teachers and friends about low blood sugar, including emergency preparedness. It provided a structured, patient-centered approach with 318 words and 7 items (Fig. 1-C and Table 3).
Characteristics of responses from Artificial Intelligence Models in Case of a Pediatric Diabetes Patient Facing Stigma
GPT-4 omni
Although effective, this response was shorter (420 words) and less comprehensive than the GPTs System, covering 6 items. It focused on simple explanations but lacked specific recommendations for emergency preparedness (Fig. 1-C and Table 3).
GPT-o1 preview
Delivered the least detailed response, focusing on emotional reassurance and brief practical advice with only 49 words and 1 item. It encouraged seeking help from teachers but did not provide sufficient depth for dealing with emergencies (Fig. 1-C and Table 3).
Sentiment analysis
For this case, the Diabetes Self-Management GPTs Support System achieved the highest sentiment score of +1.0, reflecting its highly empathetic and motivational language, which is crucial for addressing issues related to stigma in pediatric patients. GPT-4 Omni followed with the sentiment score of +0.9, maintaining a friendly but somewhat formal tone. GPT-o1 Preview scored +0.8, focusing more on encouragement than detailed support (Fig. 2-C).
Case of a diabetic patient on “sick day”
Diabetes Self-Management GPTs support system
Provided solid advice on insulin continuation, carbohydrate substitution with liquids, and hydration, with 318 words and 7 items. However, it lacked details on ketone monitoring and DKA management (Fig. 1-D and Table 4).
Characteristics of Responses from Artificial Intelligence Models in Case of a Diabetes Patient on “Sick Day”
GPT-4 omni
Offered a more comprehensive approach with a detailed explanation of glucose monitoring, ketone testing, and insulin adjustments based on sensitivity factors. It was the most detailed response, with 420 words and 6 items (Fig. 1-D and Table 4).
GPT-o1 preview
Delivered the shortest and least detailed response, merely advising the patient to seek medical help with 49 words and 1 item. While accurate, it did not provide the necessary self-management advice required in a sick-day scenario (Fig. 1-D and Table 4).
Sentiment analysis
In the diabetes sick-day management scenario, the Diabetes Self-Management GPTs Support System provided supportive guidance with a sentiment score of +0.8. GPT-4 Omni, although comprehensive, received the slightly lower sentiment score of +0.7, as it maintained a clinical tone. GPT-o1 Preview, with a sentiment score of +0.5, emphasized caution without engaging deeply on the empathetic level (Fig. 2-D).
Tendency of responses
The findings reveal that GPT-4 Omni consistently delivered the most comprehensive and detailed responses across all case scenarios. Its outputs were notable for their length and complexity, demonstrating an exceptional capacity for nuanced clinical information. In contrast, the Diabetes Self-Management GPTs Support System prioritized brevity and practicality, offering concise and actionable guidance but lacking the depth required for more intricate scenarios. GPT-o1 Preview, while maintaining accuracy and safety, provided the least detailed advice.
Sentiment analysis identified the Diabetes Self-Management GPTs Support System as the most empathetic and patient-centered model. Although GPT-4 Omni excelled in clinical precision and detail, its tone was comparatively less empathetic. GPT-o1 Preview, by contrast, emphasized procedural clarity but was the least emotional engaging.
These results highlight the diverse capabilities of ChatGPT-based AI systems in addressing the complex demands of diabetes management.
Discussion
The comparative analysis highlighted that each AI model possesses distinct response characteristics. Diabetes Self-Management GPTs Support System excelled in providing accessible, empathetic, and concise guidance on basic topics such as insulin administration. This empathetic communication strategy is essential in digital health interventions. 15 It is believed that designing systems that leverage these characteristics to encourage user behavior change can lead to improved adherence to digital health tools. 16 On the contrary, GPT-4 Omni demonstrated strength in delivering detailed and comprehensive advice for more complex cases. However, the length and complexity of its responses have raised some concerns about accessibility for general patient populations. The GPT-o1 Preview lacked detail in its responses compared with the other two models and also had a lower Sentiment score. Therefore, further updates were expected for this model series. These differences underscore the need for a stratified approach to AI implementation. Providing basic models for the general public and deploying advanced systems for high-risk patients can help overcome barriers related to digital literacy. This approach enables the efficient use of AI in various diabetes self-management scenarios and maximizes its utility. The combination of basic and advanced models might provide a versatile, accessible, and balanced solution capable of addressing both general and high-risk scenarios.
In addition, the empathetic and adaptable communication style of the Diabetes Self-Management GPTs Support System shows significant promise for use in public health campaigns. Addressing stigma in pediatric patients or providing reassurance to older adults, these systems can build trust and engagement with target populations. Leveraging AI-driven platforms for targeted health education and support can expand the reach and effectiveness of diabetes prevention and management initiatives. However, what this study has demonstrated is merely that the customized GPTs achieved the highest sentiment score when using the same simple prompt. It is important to note that other models might also generate responses that are equally supportive and concise if prompt engineering or additional commands are applied.
This study revealed that despite variations in responses, all AI models provided fundamentally accurate and appropriate guidance. AI systems for diabetes self-management have the potential to help patients resolve simple issues independently by addressing routine questions. This capability enhances patient autonomy, reduces the burden of clinic visits, and alleviates the workload of health care providers. Moreover, such AI systems could play a critical role in addressing health care access disparities, offering tailored support to vulnerable populations, such as individuals in resource-limited environments. 17 It has been reported that telemedicine can improve clinical outcomes for diabetes. 18 Particularly in regions facing shortages of diabetologists or endocrinologists, integrating AI into telemedicine services is likely to enhance patient monitoring and help prioritize care delivery.
While the potential benefits of AI systems are clear, safety and ethical considerations must remain a top priority. This study highlighted that models such as GPT-o1 Preview emphasize safety by advising patients to seek medical attention rather than providing detailed self-management guidance. While this approach ensures patient safety, it may lack immediacy in urgent scenarios such as sick-day management. Therefore, future health policies should include regulatory frameworks to ensure that AI systems undergo rigorous testing for safety, accuracy, and appropriateness. Policies should also emphasize the importance of human oversight to ensure patients receive timely and appropriate medical advice. Clear guidelines on the use of AI in patient care are essential to maintaining trust and safety.
Furthermore, this study highlights the value of enabling health care professionals to modify and adapt GPT-based technology for specific clinical needs. Maximizing the effectiveness of AI requires that health care providers understand its capabilities and limitations. The insights from this comparative analysis suggest that health care professionals, with their nuanced understanding of patient care, can optimize these AI systems to enhance empathy, practical applicability, and patient-centeredness. Tailoring GPT-based models to meet diverse patient requirements, health care providers can better align AI capabilities with real-world health care demands, thereby promoting more effective and adaptable self-management support systems.
Limitations
Several limitations in this study should be recognized. First, the evaluation was based on single responses generated by LLM-based AI, which are inherently variable and context-dependent. Comparing a larger number of responses would provide a more comprehensive understanding of each AI model’s characteristics. To verify the validity of responses obtained from the AI model, we conducted an additional analysis by inputting minor variations of the original prompts, which are reworded versions of the original content into ChatGPT (because the GPT-o1 Preview model used in the initial analysis was discontinued in January 2025 and is no longer available, we used the successor model, GPT-o1, instead of GPT-o1 Preview). As a result, while there were significant differences in response length and quality between GPT-o1 Preview and GPT-o1 due to the model change, both the Diabetes Self-Management Support GPTs system and GPT-4 Omni produced responses with trends similar to those observed in the initial analysis (Fig. S1 and Table S1-4 in Supplemental Data). This analysis confirms that, when employing fundamentally identical prompts, the characteristics of the responses remain largely consistent, provided that the underlying AI model remains unchanged. Therefore, although this verification was based on a single trial, the validity of the initial responses in this study was demonstrated to a certain extent (Supplementary Data).
Second, the evaluation was conducted using simulated scenarios, which may not fully capture the complexities of real-world diabetes management. While these scenarios were designed to represent common issues faced by diabetes patients, they do not encompass the full range of challenges encountered in daily life. Future research should involve real-world studies to validate the effectiveness of AI systems in both clinical and home settings over the long term. In addition, the use of sentiment analysis to quantify empathy may not fully reflect the subjective impressions of individual users, as it does not account for cultural, linguistic, or personal factors influencing patient perceptions.
Third, the responses provided for the 8-year-old child were not adequately tailored to their developmental level overall. While the responses had a friendly tone, they contained excessive detail that could overwhelm a young child. In particular, when providing age-appropriate responses for young children, there may be inherent system limitations when using simple prompts.
Finally, the study did not address ethical concerns or issues related to data privacy and security in depth. Ensuring the safe and ethical use of patient data are critical for the successful integration of AI into health care. Future research should explore these aspects rigorously, establishing regulatory frameworks and guidelines to build trust and ensure the responsible deployment of AI systems.
Conclusions
This study compared three LLM-based AI models—Diabetes Self-Management GPTs Support System, GPT-4 Omni, and GPT-o1 Preview—in diabetes self-management. The results showed that GPT-4 Omni provided the most detailed responses, Diabetes Self-Management GPTs Support System was concise and empathetic, and GPT-o1 Preview prioritized safety but lacked depth.
These differences highlight the need to select AI models based on user needs and the potential for customization by health care professionals. Future research should validate these findings in real-world settings and establish guidelines for AI integration in diabetes care.
Footnotes
Acknowledgments
The authors thank OpenAI for providing the ChatGPT, which we used to generate responses. K.T., the first author of this article, received the inaugural 2023–2024 Quad Fellowship, which is an initiative of the governments of Australia, India, Japan, and the United States. The Quad fellowship develops a network of science and technology experts committed to advancing innovation and collaboration in the private, public, and academic sectors, in their own nations and among Quad countries. Thank you to the Quad Fellowship for their support of this project. Furthermore, we gratefully acknowledge National Center for Geriatrics and Gerontology for their intellectual input regarding our research framework.
Authors’ Contributions
K.T. and T.O.: Conceptualized and designed the study, including the main ideas and proof outline. K.T.: Performed the formal analysis and developed the methodologies. H.O. and T.M.: Managed and organized the data. H.N., T.K., T.S., and S.K.: Contributed to interpreting the results and provided constructive feedback. K.T.: Drafted the initial article. H.O.: Reviewed and refined the article providing valuable insights. T.O. and H.T.: Supervised the project and contributed to finalizing the article. All authors reviewed and approved the final version.
Author Disclosure Statement
All authors do not have any conflict of interest.
Funding Information
No funding agency played any role in the preparation of this article.
Abbreviations Used
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
