Abstract
Introduction
Large language models (LLMs) are increasingly being used by patients for health information, yet their reliability in orthodontics remains uncertain. This study aims to evaluate the accuracy, reliability, quality, and readability of orthodontic retention information generated by ChatGPT 3.5, ChatGPT 4, Gemini, and Copilot.
Materials and Methods
Twenty-three frequently asked questions about orthodontic retainers were collected and categorised into general retainer questions (n=8), fixed retainer questions (n=5), and removable retainer questions (n=10). Questions were entered into each AI model once under standardised conditions. Responses were anonymous and independently assessed by two consultant orthodontists. Accuracy was scored using a five-point Likert scale, reliability with the modified DISCERN tool, quality with the Global Quality Scale (GQS), and readability with the Flesch Reading Ease Score (FRES). Statistical analysis included ANOVA, Kruskal-Wallis, post-hoc tests, and intraclass correlation coefficients (ICC).
Results
Evaluator agreement was excellent across all domains (ICC 0.821-0.957). ChatGPT 3.5 achieved the highest accuracy (mean 4.49), while ChatGPT 4 and Copilot scored highest in reliability (means 30.47 and 30.11). ChatGPT models outperformed Gemini and Copilot in quality, with over 75% of their responses rated good to excellent. Readability was low across all models; however, Copilot produced relatively more readable text (mean FRES score of 53.93).
Limitations
This study is limited by its focus on single-turn responses which may not reflect the iterative interactions typical of real patient - AI conversations. In addition, the evolving nature of AI models may affect reproducibility, and its restriction to English, may limit its generalizability across languages.
Conclusion
All AI models demonstrated moderate competence in providing orthodontic retention information, but their reliability was inconsistent, and readability was poor, necessitating human oversight and methodological refinement rather than serving as replacements for professional advice.
Introduction
Artificial intelligence (AI) refers to computer systems designed to perform tasks that typically require human intelligence, such as problem solving, reasoning, and language use. 1 Within AI, machine learning (ML) enables these systems to improve performance by identifying patterns in large datasets and adjusting outputs accordingly. A further development is the creation of large language models (LLMs), which are trained on massive collections of text using deep learning techniques and natural language processing methods.
ChatGPT, developed by OpenAI Inc. (San Francisco, CA, USA), is based on large transformer engines and trained on extensive quantities of text. Version 3.5 is freely available, while version 4 offers enhanced reasoning and contextual abilities through the subscription based ChatGPT Plus. Gemini, introduced by Google DeepMind (Google Ireland Limited, Dublin, Ireland), integrates advanced language understanding with multimodal capabilities, designed to provide versatile responses across various input types, including text. 2 Copilot, created by Microsoft (Microsoft Corporation, Redmond, WA, USA), is integrated into its productivity ecosystem and combines OpenAI’s GPT models with web search functions to generate contextually informed answers. 3 All these models are capable of generating human like responses to natural language prompts by predicting the most probable sequence of words. However, despite their shared reliance on LLM technology, these platforms differ in terms of training data, accessibility, and intended use cases, which may impact the accuracy and reliability of the information they provide in specialised fields.4,5
In healthcare, including dentistry, AI tools are increasingly being used to provide patients with accessible, on-demand information. 6 Recent studies have evaluated the performance of large language models in several orthodontic subfields, including clear aligner therapy, impacted canine diagnosis, crossbite correction and orthodontic consultation queries. These investigations demonstrate that while AI systems can provide generally accurate information, their reliability, citation behaviour, and readability vary significantly across models and clinical topics.4,5 In orthodontics, one of the most critical phases of treatment is retention, which aims to maintain the corrected tooth alignment achieved after active orthodontic procedures, such as fixed appliances or clear aligners. Proper retention is essential to prevent relapse, which could undermine the results of months or even years of orthodontic treatment. 7 However, despite the clinical importance of orthodontic retention in maintaining treatment outcomes, the reliability of AI-generated information specifically addressing retainer use and maintenance has not been systematically evaluated.
As more patients turn to AI-based platforms for health-related information, the accuracy and readability of the content provided become crucial. While AI offers advantages in terms of accessibility and convenience, there are growing concerns about the potential for misinformation, especially in specialized fields such as orthodontics. 5 Studies have shown that AI tools like ChatGPT can produce seemingly accurate responses; However, these may sometimes lack depth, contain inaccuracies, or be written in a manner that is too complex for patients to understand. Given the complexity of orthodontic retention, the quality of information provided by AI systems must be carefully evaluated to ensure that patients are receiving reliable advice that aligns with clinical best practices. 8
Furthermore, orthodontic retention requires a precise understanding of when, how, and for how long retention appliances should be used to maintain the desired outcomes of orthodontic treatment. Incorrect or misleading information in this area could lead to improper use of retainers, ultimately affecting patient satisfaction and treatment success. 9 Therefore, understanding whether ChatGPT can reliably and effectively communicate retention-related information is critical for its potential role as an adjunctive tool for both patients and orthodontic professionals.
To address these concerns, this study aims to evaluate the reliability of orthodontic retention information produced by different AI-based programs namely; ChatGPT 3.5, Chat GPT 4, Gemini, and Copilot in terms of accuracy, relevance, and readability, as assessed by consultant orthodontists.
Materials and Methods
Final list of questions entered into AI models.
The AI models evaluated in this study included ChatGPT (GPT 3.5, free version) and ChatGPT (GPT 4, available through a ChatGPT Plus subscription), as well as Gemini and Copilot. Previous research indicates that ChatGPT may produce varying and sometimes faster responses when the same question is asked multiple times or at different intervals. 10 To maintain consistency, each question in this study was submitted only once. Gemini, which generates three response drafts for each query, 11 was evaluated using only the first draft in this study. The first draft is the default response automatically displayed by Gemini and in real-world clinical scenarios, patients typically read and rely on this first response and rarely explore alternative drafts. 2 Furthermore, using only the first draft ensures methodological consistency with the single-response approach applied to ChatGPT-3.5, ChatGPT-4, and Copilot, enabling fair comparison across all evaluated AI models. As for Copilot, the “more balanced” standard mode was selected among its three available communication styles. To avoid response bias, a separate user account was created for each AI model, and a new chat window was opened for every question. All questions were asked on the same day using the same laptop (MacBook Air 13, 1.6 GHz Dual-Core Intel® Core™ i5, 8 GB 1600 MHz DDR3, Intel HD Graphics 6000 1536 MB) and the same fixed fibre internet connection. All questions were entered using standardized, concise declarative phrasing to ensure consistency across models and minimize potential variability introduced by prompt wording. The responses from ChatGPT 3.5, ChatGPT 4, Gemini, and Copilot were then organised into four separate forms (A, B, C, and D), with all identifiers of the AI models removed to ensure evaluator blinding. Two experienced consultant orthodontists independently reviewed the responses, assessing them in reference to current literature and clinical practice.
Accuracy was evaluated using a five-point Likert scale.12,13 On this scale, a score of 1 indicated that the AI’s response was completely incorrect, while a score of 2 reflected answers containing more incorrect information than correct. A score of 3 represented an equal balance of correct and incorrect information, a score of 4 indicated that the response contained more correct information than incorrect, and a score of 5 denoted a completely correct answer.
The DISCERN tool was developed to help both patients and healthcare providers evaluate the quality of health-related information. 14 It consists of 16 questions, each scored on a scale from 1 (low quality) to 5 (high quality). The first eight questions measure reliability, the next seven assess treatment options, and the final question evaluates the overall quality based on the previous responses. For the purpose if the study only the first 8 questions concerning reliability of the information was used in the assessment form assessing: (i) clarity of objectives, (ii) achievement of objectives, (iii) relevance, (iv) identification of information sources, (v) timing of the information provided, (vi) balance and impartiality, (vii) inclusion of additional support or resources, and (viii) acknowledgment of areas of uncertainty. Each question is scored as 1 for “no,” 2-4 for “partial,” and 5 for “yes.” The total score is then classified as poor (8-15 points), moderate (16-31 points), or good (32-40 points). For each question in the modified DISCERN scale, the total score was calculated by scoring the no answer as 1, the partial answer as 2-3-4, and the yes answer as 5. The total score was then categorised as poor (8-15 points), moderate (16-31 points), or good (32-40 points). 15
Additionally, the GQS, a five-question scale, was used to assess the usefulness and quality of the information provided for patients. 16 Scores were calculated by summing the points for each section. A GQS score of 3 or below was considered to be of low or moderate quality, while those with a score above 3 were classified as good to excellent quality. These tools provided a structured framework to assess the clarity, accuracy, and relevance of the information generated by the AI models.
Readability was assessed using the Flesch Reading Ease Score (FRES).15,17 The readability of the chatbot responses was calculated with the Microsoft Word for Mac FRES tool (version 16.89.1 [24,091,630]; Microsoft®). The formula applied was: 206.835 - 1.015 × (total words ÷ total sentences) - 84.6 × (total syllables ÷ total words). This calculation produces a score between 0 and 100, where lower scores (0-59) indicate difficult readability, scores between 60 and 69 are considered standard readability level, and higher scores (70-100) indicate easier readability.
Data analysis
Quantitative data were summarised using the mean, standard deviation (SD), median, and interquartile range. The Shapiro-Wilk test was applied to assess the normality of data distribution. For variables following a normal distribution, one-way analysis of variance (ANOVA) with post-hoc Tukey’s multiple comparison test was performed to evaluate differences between groups. For non-normally distributed variables, the Kruskal-Wallis test with post-hoc Dunn’s multiple comparison test was used. Interobserver agreement was assessed using the intraclass correlation coefficient (ICC), with a significance level of p < 0.05 applied to all analyses.
Results
The intraclass correlation coefficient of evaluators data for the Likert scale, Modified DISCERN and Global Quality Scale.
Descriptive statistics of the Likert, modified DISCERN, GQS and FRES scores compared between the four AI groups using One way ANOVA test.
*- significant.
**- highly significant.

Comparison of the mean accuracy levels of the four AI models.
The modified DISCERN scores showed statistically significant differences among the four AI models (p = 0.003). ChatGPT 4 achieved the highest mean score (30.47 ± 8.69), followed closely by Copilot (30.11 ± 9.03) and Gemini (29.25 ± 9.90), while ChatGPT 3.5 recorded the lowest score (26.22 ± 5.11) (Table 3 and Figure 2). When categorised, 54.1% of the responses across all models were rated as moderately reliable, 37.7% were classified as good, and 8.2% were considered poor, with the lower scores more frequently observed in outputs from Gemini and Copilot (Table 4). Comparison of the mean DISCERN scores of the four AI models. Score distribution of the AI models’ responses according to the modified DISCERN scale and global quality scale classification. Categorical variables (number of questions) are shown as n (%) in the table.
The GQS analysis revealed significant differences among the four AI models (p < 0.001). ChatGPT 3.5 and ChatGPT 4 produced the highest quality scores, with more than 75% of their responses rated as good or excellent (Figure 3). Gemini and Copilot generated a higher proportion of responses in the moderate to lower quality range, though they still produced some outputs rated as good quality. Overall, the responses across all models were generally within an acceptable to high-quality range, with notable variability in the consistency and depth of information provided (Table 4). Comparison of the mean GQS scores of the four AI models.
The readability analysis, using the Flesch Reading Ease Score (FRES), showed that all AI-generated responses fell within the “difficult to read” range, with scores below 60 across all models (Figure 4), indicating that the generated content would likely require a reading level equivalent to late secondary education. This suggests that AI-generated orthodontic information may remain challenging for many patients to understand without further simplification. Copilot produced the highest readability score (53.93 ± 5.47), followed by ChatGPT 3.5 (49.24 ± 6.81), while Gemini (47.50 ± 11.21) and ChatGPT 4 (47.24 ± 6.00) had the lowest scores. These differences were statistically significant (p < 0.001), indicating notable variation in the reading complexity of the content generated by the different models (Table 3). Comparison of the mean readability levels of the four AI models.
Post-hoc pairwise comparison of scores in AI models. Tukey’s post hoc test was used for Modified DISCERN and Flesch Reading Ease Score, and Dunn’s post hoc test was used for GQS.
*- significant.
**- highly significant.
Discussion
This study critically evaluated the accuracy, reliability, quality, and readability of orthodontic retention related information generated by four leading artificial intelligence models using validated assessment tools and blinded expert evaluation. The assessment tools used including the modified DISCERN instrument, GQS, and the Likert accuracy scale, have been widely applied in previous studies evaluating the quality and reliability of online and AI generated health information, including in dental and orthodontic contexts (Supplementary Table S1). These tools provide structured and validated approaches for assessing information accuracy, reliability, and clinical usefulness.4,6,8,12,16,17 Additionally, the findings revealed consistently high interobserver agreement across all domains, confirming the robustness of the evaluation methodology and minimises the risk of subjective bias influencing the results. Previous studies have shown that expert raters tend to reach consistent evaluations despite model variability especially with the use of validated instruments.4,5,18
Among the models, ChatGPT-3.5 demonstrated the highest accuracy scores, indicating a relatively superior ability to generate actually correct content. Conversely, ChatGPT-4 and Copilot outperformed other models in reliability, as reflected by higher modified DISCERN scores, suggesting a greater degree of coherence, relevance, and balanced information delivery. This apparent contradiction may reflect differences in model training priorities and output optimization. ChatGPT-3.5 appears to generate responses that emphasize factual correctness and direct answer generation, potentially reflecting training processes that prioritize predictive accuracy over structured communication. However, these responses may lack key elements required for high reliability scores, including explicit referencing, balanced discussion of treatment options, and acknowledgment of uncertainty. In contrast, models such as ChatGPT-4 and Copilot may produce more structured and context-aware outputs, which align more closely with the criteria assessed by DISCERN, even if their factual accuracy is slightly lower.19,20 This interpretation is consistent with previous literature, which has demonstrated that large language models often exhibit asymmetric performance across evaluation domains, where accuracy, transparency, and information quality do not necessarily improve simultaneously.4,5
In terms of quality, as measured by the GQS, both ChatGPT models consistently produced higher quality responses compared to Gemini and Copilot, although variability in depth and clinical relevance was evident across all platforms. The results of this study indicate that no single model achieved dominance across all performance domains. This multidimensional profile is broadly consonant with recent orthodontic and medical literature, which shows that LLMs can deliver generally accurate and moderately reliable answers, but often with variable depth or citation practices.4,5,8,21,22 Dursun and Bilici Geçer (2024) reported that, for clear aligner questions, all four models (ChatGPT 3.5/4, Gemini, Copilot) produced generally accurate, moderate to good quality outputs, and models differed in reliability and citation behaviour with Copilot often scoring higher on DISCERN and providing more references while also noting variability among ChatGPT versions in other domains. 4 A comparable study found heterogeneous performance across ChatGPT 3.5/4, Bard/Gemini, and Bing/Copilot on evidence-based orthodontic questions, reinforcing that relative strengths can shift by metric and question type rather than by model brand alone. 5 A study measuring the accuracy and reliability of ChatGPT 4 in providing information regarding impacted canines, interceptive orthodontics, and orthognathic surgery has concluded that responses can be “generally good” yet inconsistent and not a substitute for clinical judgment, mirroring the findings of the current study that higher modified DISCERN scores do not uniformly indicate high patient facing quality or accessibility. 8 However, some studies have reported results that diverge from the findings of the present investigation. For instance, studies have reported that ChatGPT-4 showed notable improvements in accuracy over earlier versions, suggesting that frequent updates and retraining can lead to rapidly enhanced performance in specific contexts.17,23 Likewise, Miller et al. (2025) demonstrated that large language models can provide neuroimaging decision support at a level comparable to that of human experts, highlighting that outcomes may depend heavily on the clinical field, the complexity of the questions, and the type of knowledge required. 22 Nasra et al. (2025) also reported more favourable outcomes for AI in generating patient educational materials, which may be due to differences in methodology, evaluation tools, and end user expectations resulting in variability across studies. 24
Readability analysis, measured by the Flesch Reading Ease Score, indicated that Copilot generated the most accessible content among the models, though all tools produced text that would be considered difficult for the average patient to understand, highlighting a significant barrier to effective patient communication. Post-hoc analysis underscored these distinctions where Copilot’s readability advantage did not compensate for its comparatively lower quality scores. Readability is a persistent limitation across investigations, reinforcing our finding that texts typically fall into the “difficult” range on FRES. The limited readability observed across all models may be explained by several linguistic features of AI-generated medical text, including the frequent use of professional terminology, long sentence structures, and complex explanatory phrasing. These characteristics increase the number of syllables per word and words per sentence, both of which directly reduce FRES scores. Previous literature have documented difficult readability across various AI models despite acceptable correctness and quality.4,17 Although some studies noted that some AI models showed promising readability improvements and model specific advantages in readability, 24 the collective evidence indicates that most outputs remain above the literacy level recommended for lay patient education highlighting the need for deliberate simplification and clinician mediated modifications before clinical distribution.
From a critical perspective, these results highlight both the promise and the current limitations of generative AI in orthodontic patient education. While the models exhibit clear potential as adjunctive tools to support communication and information dissemination, their outputs remain inconsistent and, at times, insufficiently tailored to patient comprehension levels. To integrate these tools safely, orthodontic clinicians should treat AI responses as drafts, modifying them for accuracy, rewriting in plain language, and adding reliable references. 25 Clinics can also leverage these tools to produce multilingual or pictorial materials for patients with different literacy levels, making education more inclusive. Clear disclaimers, regular quality checks, and ongoing clinical involvement are key to ensuring these tools are used responsibly in clinical settings.
The limited readability observed across all models can be attributed to several underlying linguistic features of AI-generated medical text. First, the frequent use of domain-specific terminology, such as orthodontic and clinical terms, increases lexical complexity and may reduce accessibility for lay users. Second, AI-generated responses often contain longer sentence structures, which increase the average number of words per sentence—a key determinant in readability formulas such as the Flesch Reading Ease Score (FRES). Third, the logical structuring of responses, which frequently involves multi-step explanations and embedded clauses, contributes to increased cognitive load and reduced readability. These findings are consistent with previous literature, which has shown that large language models tend to produce medically accurate yet linguistically complex outputs that exceed recommended readability levels for patient education.15,24 Similar studies in orthodontics and broader healthcare domains have reported that AI-generated information often falls within the “difficult” readability range despite acceptable accuracy and quality, highlighting a persistent gap between clinical correctness and patient comprehension. 4 Collectively, these findings suggest that improving readability in AI-generated health information will require targeted optimization strategies, including simplification of terminology, reduction of sentence length, and restructuring of explanations into more accessible formats.
When interpreting the results of this study, it is important to consider the limitations. This study evaluated AI generated responses using a single turn interaction design, which may not fully reflect real world usage where patients typically engage in iterative, multi turn conversations. In practical settings, users often refine their queries, request clarification, or seek simplified explanations, allowing AI systems to adapt responses dynamically. Previous studies have shown that multi turn interactions can enhance contextual coherence, improve relevance, and potentially increase both perceived quality and usability of AI generated information. 26 So, by testing only single questions, the study may not fully capture how the AI would perform in a real patient conversation.4,18
A further limitation is the rapid pace of AI model development, which can significantly impact reproducibility. The evaluated models correspond to the versions available at the time of data collection. Large language models are not static products but are continuously updated, finetuned, and retrained on new data. As a result, the same question posed to a model at different points in time may yield different responses, complicating efforts to replicate findings across studies. This highlights the need for longitudinal studies that systematically evaluate AI models over extended periods. Such research would enable the tracking of how iterative updates and model refinements influence critical performance domains, providing a clearer understanding of the stability and reproducibility of large language model outputs in clinical contexts. 26
It is important to distinguish between expert based clinical accuracy and patient facing usability. While many responses contained correct orthodontic information, the readability analysis suggests that patients may still struggle to interpret the content without clinician guidance. Although the primary aim of this study was to evaluate the accuracy of the information provided by AI models, it would have been valuable also to assess patient comprehension, engagement, and satisfaction, as these factors are central to determining the real world effectiveness and clinical usefulness of AI mediated education. The absence of stratified analysis across question categories, due to the small and uneven distribution of items is an additional limitation. Future research using larger and more balanced datasets should enable more robust subgroup comparisons and clearer evaluation across specific clinical contexts.Another limitation is that this study assessed the accuracy, reliability, and readability of AI generated orthodontic information retrieved exclusively in English. Large language models are trained on multilingual corpora; however, their performance and output quality can vary significantly across different languages. Prior research has shown that ChatGPT and similar AI systems demonstrate disparity in AI models performance depending on the language used. 27 Similar concerns have been raised in dental research, where AI-generated responses may not maintain consistent quality across different linguistic settings. For example, a recent study evaluated and compared the performance of advanced large language models in answering dental multiple choice questions in both Arabic and English languages and found that the models achieve higher accuracy in English than in Arabic. Additionally, the models exhibited translation inconsistencies and reduced interpretative performance in non English contexts, underscoring the importance of multilingual evaluation. 28 Consequently, the findings of the present study cannot be generalized to other linguistic contexts, and future research should investigate AI model performance in multiple languages to ensure equitable patient education and global applicability.Conclusion Large language models demonstrate moderate competence in delivering orthodontic retention related information under controlled conditions. ChatGPT-3.5 demonstrated the highest accuracy and may therefore be more appropriate for generating clinically accurate information. In contrast, ChatGPT-4 and Copilot showed higher reliability, indicating better structured and balanced outputs. Copilot, which achieved relatively higher readability scores, may be more suitable for patient-facing educational content. However, they do not yet match expert level clarity, reliability, or readability. While promising as adjunctive education tools, they still require human oversight and methodological improvements before broader clinical adoption. Future research should explore the performance of AI models in multi-turn conversational settings, assess their effectiveness across different languages, and evaluate patient comprehension and usability to better determine their role in clinical communication and patient education.
Supplemental material
Supplemental material - Accuracy and readability of artificial intelligence models in providing orthodontic retention related information: A cross-sectional study
Supplemental material for Accuracy and readability of artificial intelligence models in providing orthodontic retention related information: A cross-sectional study by Afnan Ben Gassem and Nebras Al-Thaqafi in Digital Health.
Footnotes
Ethical considerations
No ethical approval was requested since this study was not conducted on human or animal subjects.
Author contributions
Afnan Ben Gassem conceptualised and designed the study, collected and curated the data, conducted the formal analysis and wrote the original draft. Nebras Al-Thaqafi conceptualised and designed the study, collected the data and reviewed, re-wrote and edited the final draft of the study.
Funding
The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This scientific paper is derived from a research grant funded by Taibah University, Madinah, Kingdom of Saudi Arabia – with grant number (1039-13-447) (447-13-1039).
Declaration of conflicting interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Data Availability Statement
The datasets generated and analysed during the current study are available in the supplementary materials. These include the original AI-generated responses for each model and the detailed scoring tables used for accuracy, modified DISCERN, GQS, and readability assessments.
Supplemental material
Supplemental material for this article is available online.
Appendix
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
