Abstract
Introduction
Healthcare providers have anticipated patients’ use of chatbots to seek medical information with optimism and pessimism.1,2 Such usage is a continuation of internet search engine use.3,4 Nevertheless, chatbots may fail to provide accurate information, and the information that is provided may not be written at an easily comprehensible level.5,6 Numerous studies across various medical fields have examined chatbot responses to patient questions. These data have yet to be synthesized. There are at least three reasons why synthesis at this early stage of chatbot use by patients could be helpful. First, healthcare providers need to be able to assess the accuracy and readability of chatbot responses. This ability is vital because providers may never have the opportunity to corroborate the chatbots’ answers with the patients. Patients may also benefit from learning the chatbots’ strengths and weaknesses. Secondly, synthesis may identify factors that influence response quality and readability, thereby enabling improvement in both. Thirdly, synthesis may reveal failings in how chatbots are currently used and studied in this context.
We sought to address these issues through a systematic analysis of the current literature evaluating chatbot responses to both real and hypothetical patient questions. Our cross-sectional meta-synthesis includes findings from research published between 2022 and July 2025.
Methods
Assessment tools used
The DISCERN instrument was originally designed to “assess the quality of written information on treatment choices for a health problem”.7,8 The instrument uses 16 questions, each graded on a one-to-five scale. A grade of one means the answer did not address a question component, three means the element was partially answered, and five means the component was fully answered. Therefore, the possible range is 16–80. Response quality is classified as: very poor (16–27); poor (28–38); fair (39–50); good (51–62); and excellent (≥63).
We assessed readability using the Flesch Reading Ease Score (FRES) 9 ; the higher the score, the easier the text is to read. FRES ranges from 0 to 100, with values ascribed to different education levels. 10 In the United States (US), the recommendation is that patient material is written at a sixth-grade reading level 11 ; corresponding to a FRES ≥80. 10 FRES values in the 30–50 range are considered college-level, 10–30 post-graduate, and <10 professional reading levels.
We included a second readability index: the Flesch-Kincaid Grade Level (FKGL). 12 This measure assigns text to a US school grade based on the ability to read the material. Like FRES, FKGL is calculated based on a formula containing the average number of words in each sentence and the average number of syllables in each word. The two methods use different weightings of these components.
Study identification and data extraction
The primary aim of our study was to evaluate the clinical quality of chatbot-generated responses using the DISCERN tool. To preserve a connection between our evaluation framework (DISCERN) and the assessed responses, we limited our literature search to PubMed-indexed studies. This approach ensured that the chatbot responses assessed were grounded in clinical and patient-oriented contexts.
We identified potential studies using the PubMed search engine with three queries: (1) “chat* AND DISCERN”, (2) “chat* AND Flesch”, and (3) “chat* DISCERN Kincaid”. The reference lists of all the identified studies were searched. As an additional check to confirm that we found relevant studies, every study included was located in the Web of Science database (Clarivate, London, UK), and papers citing those studies were examined. We constructed a PRISMA flowchart to illustrate the search and selection process, which was conducted in mid-July 2025.
We read the abstracts of all identified publications. If the abstracts indicated that the study did not use the DISCERN score along with either FRES or FKGL, or that the DISCERN score was used to assess brochures or other printed patient information, those studies were excluded. All other studies were downloaded and examined in detail to determine whether or not they met the inclusion-exclusion criteria. Studies that used a modified or brief DISCERN score were excluded, except for studies that used the first 15 of the 16 DISCERN questions. The final question assesses the overall quality of the information provided. In those cases, we averaged the scores for the first 15 questions and then added that value to the total. In studies that provided scores for individual DISCERN questions, we found that this extrapolation approach yielded a score that differed from the reported score by less than one point. Therefore, our averaging approach provides a reasonable approximation.
We included studies that examined either real or hypothetical patient questions. We excluded studies that did not list the complete text of the questions. We also excluded studies that sought answers to physicians’ questions.
We examined each study’s chatbot query protocol, calculated the average number of words used in the question prompts, and recorded the number of questions asked. We extracted the DISCERN score and, when available, FRES and FKGL from data presented in each manuscript. Of the 10 studies (comprising 20 tests) that provided no FRES data, eight provided the entire responses for all questions posed to the chatbots. Therefore, we determined the average FRES for these responses using an online FRES calculator (Flesch Kincaid Calculator - Flesch Reading Ease Calculator). Both authors independently extracted the data, and any discrepancies were resolved by discussion.
Analysis
To determine whether there was an association between response quality and readability, we plotted a graph of FRES and DISCERN scores and performed regression analysis. The included studies used different chatbots known to exhibit different performance, which introduces heterogeneity. Therefore, to minimize this heterogeneity, we also plotted the same relationship using only results from versions of ChatGPT.
We examined five factors that could be associated with the DISCERN score and, therefore, response quality. (1) Number of prompt words: The hypothesis was that more words would provide more detail and context, which could enhance response quality. We calculated the average and median number of words in all questions in each study. The DISCERN score was then plotted as a function of the number of words. (2) Number of questions: A greater number of questions could enhance response quality by providing context and thereby a foundation to answer subsequent questions. However, such enhancement would depend on whether the investigators cleared the chat history between questions. The DISCERN score was plotted as a function of the number of questions in each study. In a sensitivity analysis, we examined the relationship in tests that reported a query protocol consistent with deleting the question history and in tests that did not report query protocols. (3) Number of evaluators: The hypothesis was that scores averaged from multiple evaluators could be expected to be more accurate. Extensive psychology literature supports the superiority and effectiveness of group decision-making (even when there is no communication within the group), attributed to “crowd wisdom” and aggregation effects.
13
We recorded the number of evaluators who determined the DISCERN score in each study. When studies posted scores for different evaluator groups, we recorded the average score posted by the most knowledgeable group; for example, one study assessed scores from patients, senior dentistry students, and orthodontists.
14
In this example, we only recorded the scores from the orthodontists. The DISCERN score was plotted as a function of the number of evaluators. (4) Journal ranking: We hypothesized that studies reporting positive results (high DISCERN scores) could be favored by positive outcome bias and be published in higher-ranking journals (for example, see Easterbrook et al. and Emerson et al.).15,
16
There are several journal ranking metrics. We elected to use the Journal Citation Indicator (JCI: Clarivate, London, UK). JCI measures the citation impact, averaged over 3 years, for various research fields. Thus, a journal with a JCI of 1.25 has a 25% greater citation impact than the average in that category. Because the parameter is normalized, it enables comparison across fields, which, given the range of disciplines covered in the included studies, offers an advantage over parameters such as Journal Impact Factor and SCImago Journal Rank. We plotted the DISCERN score as a function of each journal’s JCI.
(5) Chatbot version/year studied: Examination of these potential influences on response quality is challenging because of differences in chatbot capabilities and because of chatbot updates. Therefore, we created four groups based on the categorization of the chatbots’ ability to provide quality responses and a temporal component reflecting when the study was conducted. For chatbot categorization, we divided the chatbots used into those with higher-versus lower-quality response capabilities. We based these decisions on the performance of chatbots answering medical questions, such as those in the United States Medical Licensing Examination (USMLE),17–19 and an assessment of the grey literature. The higher-quality group included all ChatGPT-4 versions, as well as Perplexity, Gemini Advanced and Gemini 1.5 versions, and all Claude versions. The lower-quality group included all ChatGPT-3 versions, as well as Copilot, Grok, Chatsonic, and Gemini 1.0. The temporal component was divided into two periods: 2023 and 2024. One study was conducted in 2025 and was grouped with the 2024 studies. Many studies provided the date when the questions were asked or the year was inferred from the submission date. When the date was unspecified and could not be determined from submission dates, we excluded the study from this analysis.
Multiple linear regression model
The parameters identified as potentially influencing the DISCERN score were incorporated into a multiple linear regression model. We aimed to construct an explanatory rather than a predictive model for the DISCERN scores. There was missing data in the final model. Therefore, we determined if the data were missing completely at random (MCAR) using Little’s test. We calculated variance inflation factors (VIF) in the final model to determine if collinearity was present. In addition, we assessed heteroskedasticity, skewness, and kurtosis using Cameron and Trivedi’s decomposition test.
Prompt engineering
Appropriate use of prompt wording when asking questions of chatbots is an emerging field.20,21 The apparent high level of education required to read the responses in most of the included studies led us to examine how prompts could be modified to enhance readability. In a pilot study, we selected six questions from four of the included studies.22–25 This convenience sample of questions was chosen because the original responses left ample room for improvement in readability. These six responses fell within either the college-level or the post-graduate/professional ranges. We conducted three tests for each selected question. (1) Using the same prompt as in the original study but using ChatGPT-4o and adding the phrase, “write the response all in text with no tables or bullet points”. Tables and bullet points cause problems for FRES calculators because the calculations are based on sentence length. 26 (2) Using the original prompt with the following addition, “Write your response all in text with no tables or bullet points. Write your response at a sixth-grade reading level.” (3) Using the following prompt, “You are a physician responding to a patient’s question. [original prompt question inserted here]. Write your response all in text with no tables or bullet points. You are unsure of the patient’s knowledge of this topic, so write your response at a sixth-grade reading level.” The responses were assessed using an online FRES/FKGL calculator (as mentioned above). We cleared the chatbot history between each prompt and varied the order in which the prompts were used.
Certainty of evidence
We used the GRADE tool to assess the certainty of evidence in the included studies. 27
Statistical analysis
Values are presented as means with 95% confidence intervals (CI). All analyses were conducted using Stata (version 18.0; StataCorp, College Station, TX).
Results
We identified 42 studies that conducted a total of 86 tests (Figure 1).22–25,28–65 The most common reasons for exclusion were: (1) that the texts evaluated were patient educational material (for example, brochures), (2) the lack of DISCERN scores because studies used other quality assessment methods, and (3) the lack of both a DISCERN score and FRES or FKGL. Flowchart showing the study selection process.
The studies primarily originated from the US (40%) and Türkiye (36%). The others were from Europe (17%), with two studies from Australia, and one from South Korea.
The study characteristics are provided in the supplementary material. Most studies (25 of 42) tested a single chatbot: ChatGPP-3/3.5 was used in 76% of these. Four studies tested two chatbots, six studies tested three, four studies tested four, and three studies tested six. The chatbots tested were mainly ChatGPT variants (version 3 – 35%; version 4 – 27%). Gemini and Copilot each contributed 16% of the tests. Claude was used in three tests, while Chatsonic, Perplexity, and Grok were used once.
Query protocols
Thirty-one studies (74%) provided no details of their query protocols. Eleven studies (26%) provided protocol details, suggesting that they cleared the question history before asking the next one. However, the deletion of query history was not always explicitly stated.
Quality and readability
Forty-nine (57%) of the 86 DISCERN tests were ranked in the “good” quality range or higher. In contrast, only three sets of test responses were scored below college-level readability using the FRES calculation (FRES > 50); two were from Gemini and one from ChatGPT-3.5. An additional seven tests, that only reported FKGL, were below college reading level (FKGL ≤ 12); two were from Gemini, two from Gemini 1.5, and one each from Copilot, Perplexity, and ChatGPT-4o. Thus, a total of 10 test responses (12%) were below college reading levels.
We found an association between the DISCERN score and FRES: as response quality increased, readability decreased (adjusted R2 = 0.06; p = 0.02; n = 74; Figure 2). There was no evidence of heteroskedasticity (p = 0.48). When we restricted analysis to only ChatGPT-3 and ChatGPT-3.5, the adjusted R-squared value increased (0.14; p = 0.028; n = 29; Figure 3). In contrast, we found no association for ChatGPT-4 (adjusted R2 = 0.03; p = 0.23; n = 20; Figure 3). For ChatGPT-4 tests, both the DISCERN score and FRES increased, relative to the ChatGPT3/3.5 values. The latter increase was smaller than the former. These changes shifted the data up and to the right, thereby flattening the regression (Figure 3). Readability (Flesch Reading Ease Score - FRES) plotted as a function of response quality (DISCERN score) for the 74 tests that measured both parameters. Response readability decreased as quality increased. The red vertical dashed lines indicate the categorical divisions of the DISCERN score, while the red horizontal dashed lines indicate the divisions of the FRES. The black dotted lines represent the 95% confidence intervals of the regression. aR2 is the adjusted R-squared value. Readability (FRES) is plotted as a function of the DISCERN score for different versions of ChatGPT. For ChatGPT-3 versions (left panel), we found an association between FRES and DISCERN. This association was absent for ChatGPT-4 versions (right panel).

There was no relationship between the DISCERN score and FKGL (p = 0.80; n = 75). This lack of association appeared to be due to the FKGL data being range-restricted: the range between grades 10 and 14 contained 83% of the data.
Number of prompt words
We found an association between the number of prompt words and the DISCERN score (adjusted R2 = 0.04; p = 0.03; n = 86; Figure 4); the DISCERN score increased as the number of prompt words increased. Because this result was potentially influenced by the clustering of studies at approximately 10 and 40 words, we applied a reciprocal transform (1/X) to mitigate this effect. Again, there was an association (adjusted R2 = 0.05; p = 0.017; Figure 4 inset), with no evidence of heteroskedasticity (p = 0.22). The DISCERN score is plotted as a function of the average number of words in the prompt for each test. DISCERN score increased as the number of prompt words increased. The horizontal red dashed line represents the threshold for a “good” quality response. The inset graph shows the transformed data (1/X) and the de-clustering of the points.
Number of questions
Again, we found data clustering, so a reciprocal transform was applied. There was an association between the DISCERN score and the reciprocal of the number of questions: that is, the DISCERN score increased as the number of questions increased (adjusted R2 = 0.07; p = 0.009; n = 86), with no evidence of heteroskedasticity (p = 0.61).
In the sensitivity analysis, we found that tests with no query protocol exhibited a stronger association between the DISCERN score and the reciprocal of the number of questions (adjusted R2 = 0.39; p < 0.001; n = 44). Although there was also an association for tests that likely deleted query history (p = 0.01; n = 42), the sign of the coefficient was opposite; i.e., the DISCERN score decreased as the number of questions increased.
Number of evaluators
There was considerable variation in the number of evaluators: one study used a single evaluator, 18 used two, 12 used three, one used four, eight used more than four, and two studies did not specify the number (one of these used a “panel” – we assumed ≥3). We converted the number of evaluators into a binary parameter, dividing studies with fewer than three evaluators (38 tests) and those with three or more evaluators (47 tests). We excluded the one test in which the number of evaluators was unspecified.
The data were again clustered. After applying a reciprocal transform, there was no association between DISCERN score and the number of evaluators (adjusted R2 = 0.02; p = 0.09; n = 82). As a binary variable, tests with more than two evaluators had higher DISCERN scores (55.1: 95% CI [52.6 to 57.5]) than tests with one or two evaluators (50.8: 95% CI [47.4 to 54.2]; p = 0.04). The increase represents a shift from “fair” to “good” for the quality category.
Journal ranking
We found a negative association between the DISCERN score and JCI. The DISCERN score increased as the JCI decreased (adjusted R2 = 0.04; p = 0.045; n = 72; Figure 5), with no heteroskedasticity (p = 0.33). DISCERN score plotted as a function of the Journal Citation Indicator (JCI). The DISCERN score increased as JCI decreased. The vertical red dashed line indicates the average citation impact in a journal’s subject category. The horizontal red dashed line represents the threshold for a “good” quality response.
Fourteen tests were published in journals that did not have a JCI. The DISCERN score in these did not differ from that for tests published in journals with a JCI (52.3: 95% CI [46.2–58.3] vs 52.9: 95% CI [50.7–55.1]; p = 0.84).
Chatbot version/year
Chatbot temporal and version changes in DISCERN score.
We were unable to determine the year when the tests were performed in 10 of the studies. There was no difference in the DISCERN score for such tests (54.7: 95% CI [47.9–61.4) than in the test for which the year was identified (52.6: 95% CI [50.4–54.7]; p = 0.51).
Multiple linear regression model
Multiple linear regression model (n = 65).
The mean VIF = 1.08.
Prompt engineering
ChatGPT-4o responses to the six chosen questions produced marginally higher FRES than those in the original study (28 [95% CI [18–37] vs 23 [95% CI [7–38]]), but the average remained in the post-graduate range (Figure 6). In contrast, readability increased dramatically when the prompt included an instruction to write responses at a sixth-grade reading level (FRES 75 [95% CI [67–83]]). Two of the 12 responses achieved a sixth-grade level. The majority achieved a seventh-grade reading level. The prompt containing additional context produced a similar increase in the average FRES (72 [95% CI [69–75]]).
Certainty of evidence
All of the studies were observational and so started at a GRADE level of “low”. 66 The GRADE categories of “inconsistency” and “indirectness” did not apply.
We classified the risk of bias category for the outcomes of DISCERN score and readability as serious for all studies. None of the studies indicated that the evaluators received training and practice in using the DISCERN tool. The knowledge and experience of the evaluators were seldom reported. The knowledge and experience range spanned patients (we excluded data from patient evaluations), medical students, residents, fellows, and board-certified physicians. The failure to report the query protocol suggests that some studies did not clear the chat history after each question. Consequently, those studies may have provided context for the chatbots, which could lead to better-quality responses to subsequent questions. Very few studies reported the exact method for evaluating readability; i.e., which calculator was used. Therefore, we concluded that both outcomes were potentially biased.
The imprecision GRADE category was also a source of concern. When studies reported individual evaluator DISCERN scores for the same question, there was sometimes a considerable range (as much as 60 points). Also, different readability calculators handle the presence of bullet points, numbered points or sections, and references in various ways. Consequently, different calculators assign FRES scores to the same piece of text that can differ by as much as 10 points. Thus, both outcomes were associated with imprecision.
Therefore, we downgraded all studies to the “very low” category of certainty.
Discussion
We found that the readability of chatbot responses to patient questions decreased as the quality of the responses increased. Furthermore, response quality was associated with the number of questions asked and a parameter that reflected the chatbot version and the study year. We demonstrated that response readability can be readily affected by the choice of prompt wording; improving readability is a straightforward process. Our synthesis of the early investigations into patients’ potential use of chatbots to seek medical information provides insight into the advantages and disadvantages of this approach.
Readability
The high level of reading ability required to comprehend the responses is expected given the inherent complexity of medicine, especially the frequent occurrence of multisyllabic words. Nonetheless, several studies report increased readability by adjusting the prompt, a technique known as prompt engineering. One study added the phrase “in simple terms” to the prompt, which increased FRES from college to high school level (39–55).
28
However, there was also a slight decrease in the DISCERN score: from 36 to 32 (both in the “poor” quality range). Similarly, investigators in another study (excluded from our analysis because they did not use DISCERN) included the instruction “to produce easily readable material at a sixth-grade reading level” in the prompt. The responses had an average FRES of 58, corresponding to ninth grade.
67
This was a higher FRES than achieved by any study included in our analysis (Figure 2). We also used prompt engineering to improve readability using a small sample of questions from the included studies. Readability was enhanced to eighth-grade or lower reading levels by including the phrase, “write the response at a sixth-grade reading level” (Figure 6). Readability (FRES) can be influenced by the prompt (see text for details). Six questions from four of the included studies were run using the same prompt in ChatGPT-4o and then with additional wording requesting that the response be written at a sixth-grade reading level. This approach produced a substantial increase in response readability.
The concept that how questions are asked is a critical determinant of chatbot response has only recently started to spread from computer science to medicine and the public. To date, this progress is limited. Searches (in Google Trends) for the term “prompt engineering” have not increased in the past 16 months, despite increases in searches for “chatbot” and “ChatGPT” (data not shown).
Similarly, a PubMed search for the term “prompt engineering” indicated that the phrase first appeared there in 2022 (two articles), increased to 27 articles in 2023, 213 in 2024, and 254 so far (August 7th) in 2025. If the search term “Chat*” was added to the query, this change reduced the number of articles to 18 in 2023, 102 in 2024, and 92 so far in 2025. Therefore, it is unsurprising that few prompts were explicitly designed to enhance accuracy or readability for studies conducted between 2022 and 2024. Prompt engineering is a crucial skill for physicians and patients. 68 More effective prompts should increase both the accuracy and readability of chatbot responses. Primers on writing effective prompts are now published. 69
There will be a trade-off between response quality and readability; however, prompt engineering will enable both to be tailored to desired levels.
Response quality
Quality is a subjective measure that depends on how the assessment is done and the assessors’ knowledge. Despite its systematic approach, validation, and demonstrated reliability, the DISCERN instrument remains subjective. Many of the included studies reported DISCERN scores with considerable variation among evaluators. For example, evaluator scores had a 60-point range, from “very poor” to “excellent”, for the same responses in one study. 14 Such wide ranges contribute to the scatter and low R-squared values, as well as the residual heterogeneity of our model.
Training using the DISCERN instrument is recommended.8,70 However, the number of studies that included training in their protocols is unknown; none reported training. A lack of training may lead physicians to unconsciously rely on their professional expectations and, therefore, assign greater value to technical detail over clarity. Such bias could skew DISCERN scores downward for highly readable resources. This issue likely contributes to score heterogeneity.
We should also consider whether the DISCERN instrument is appropriate for evaluating chatbot responses. This tool has been validated, 8 and its structured format provides consistency. On the other hand, DISCERN was designed to evaluate written material rather than the dynamic conversational interaction patients would have with chatbots. Modifying DISCERN to incorporate chatbot-specific elements or combining it with tools that assess different aspects of chatbot-patient interaction might enhance evaluation.
Nevertheless, DISCERN is currently the most widely used tool for evaluating response quality. Other evaluation tools, such as the modified and brief versions of DISCERN, DISCERN-AI, EQIP, global quality score (GQS) and Likert scales, also have deficiencies. Some are yet to be validated, some are more subjective and lack consistent application between studies, and some lack the granularity required to provide thorough evaluations.
Factors associated with DISCERN score
Prompt words
Our original premise was that response quality would be associated with the number of prompt words. However, the analysis did not support this idea. Although we found an association between the DISCERN score and prompt words in the initial regression, a small number of questions, each containing around 40 words, appeared to be responsible. We based our hypothesis on the concept that more words equal more specific questions and increased context. Hence, such questions yield more accurate and definitive answers. 68 That concept may, in general, be correct and is supported by five of the six ∼40-word prompts achieving “good-to-excellent” DISCERN scores. Nevertheless, even short questions can be specific enough to yield accurate and definitive answers. For example, the four-word question, “What are kidney stones?” requires no context and is explicit. Therefore, even such short questions could be expected to yield a response with “good” to “excellent” quality.
Number of questions
The number of questions remained associated with the DISCERN score in the multiple linear regression model (Table 2). This parameter would not be expected to affect the DISCERN score directly if the query history is deleted before each question, thereby eliminating potential contextual gains. Conversely, if the history is not deleted, the additional context could boost DISCERN scores. The query protocol was unspecified in the majority of studies, which suggests no history deletion. The sensitivity analysis results were consistent with the idea that the lack of a query protocol (and hence likely no history deletion) and asking a larger number of questions provided context, which increased the average DISCERN score.
The number of questions could also indirectly affect the DISCERN score because a larger number increases the likelihood of including questions that play to the chatbot’s strengths, thereby increasing the average score. Furthermore, more questions would reflect the chatbot’s actual performance level because outlier scores would have less influence on the overall average.
Chatbot version/year
Studies have indicated differences in accuracy between chatbots in other types of tests. For example, 200 questions in the style of those used in the USMLE were posed to five chatbots. Claude and ChatGPT-4 scored the highest percentage of correct answers (83% and 82%, respectively). Three other chatbots achieved lower scores: Copilot 60%, ChatGPT-3.5 58%, and Gemini 54%. 19
Such comparisons are challenging because chatbot capabilities are evolving. We combined chatbot type and temporal changes to create a categorical parameter. This grouping was positively associated with the DISCERN score. However, our chatbot categorizations may be questioned because there are conflicting assessments of their quality. Therefore, as a sensitivity analysis, we restricted the categorical parameter to the chatbots for which quantitative assessment is available: ChatGPT versions, Claude, Gemini 1.0, and Copilot. In the multiple linear regression model, the adjusted R-squared value increased to 0.24 (from 0.23), and the p-values for reciprocal questions and chatbot version/year remained unchanged. The P-value for JCI increased to 0.16 from 0.086. These results indicate the robustness of the chatbot version/year parameter.
Journal citation indicator
JCI can serve as a proxy for journal impact. Because JCI values are normalized, comparisons can be made across fields. However, the values represent three-year averages, which may not accurately reflect journal impact in rapidly evolving fields such as chatbot research.
We anticipated that studies with higher DISCERN scores would be published in journals with a higher JCI. Therefore, we were surprised to find the opposite in the crude regression analysis. The observed relationship could be consistent with higher-ranking journals being biased against chatbot use based on current perceptions of poor response accuracy, errors, and their propensity to hallucinate.2,71 Still, there are many factors involved in journals’ publication decisions. Moreover, the coefficient was relatively small; contributing only 9.6 DISCERN units over the entire range, which is less than the width of a single DISCERN category. In the adjusted model, there was only weak evidence of an association. Therefore, JCI represents, at most, a minor factor.
Number of evaluators
We found that the average DISCERN score in studies with more than two evaluators was higher and in the “good” range, compared to studies with one or two evaluators (in the “fair” range). More evaluators will reduce the influence of outlier scores. The evaluator’s knowledge and training in using DISCERN will also contribute to the score. Training was never mentioned in any of the studies, while the evaluators’ knowledge was sometimes mixed. For example, evaluators in one study included residents and faculty. 56 Therefore, the lack of association between DISCERN score and number of evaluators in the crude regression was not surprising. These issues, combined with the lack of a clear expectation regarding whether, or how, more evaluators would influence the DISCERN score, prompted us to exclude this parameter from the regression model.
Limitations
There are several limitations to our analysis. First, as evident in all the graphs, there is considerable heterogeneity. In addition to the frequent lack of experimental detail provided in the studies, other variables, such as subject matter and the specific questions asked, could not be adjusted for. Nonetheless, even under these circumstances, the observed association between response readability and quality, the ability to enhance readability through prompt engineering, and the interpretation of the regression model provide insight.
Second, chatbot research is rapidly evolving; additional studies will have been published since the conclusion of our literature search. Furthermore, ongoing updates may result in improved response quality over time. We only searched PubMed, and, therefore, it is possible that studies published in engineering fields that included physician authors were missed.
Third, almost all the studies asked each question only once. Therefore, the repeatability and reproducibility of the results are unknown. Additionally, this piecemeal approach differs from how patients typically interact with chatbots. Patients could seek clarification for some responses; none of the included studies did this. Thus, the questions posed do not reflect how patients’ conversations with chatbots might progress. Clarification and context will likely enhance the quality of responses.
The included studies were predominantly from the US and Türkiye and were all conducted in English, which may result in bias. Although the influences of location and language have yet to be systematically evaluated, one study reported that responses to the same healthcare question appeared to depend on the user’s location. 72 The authors assessed responses from ChatGPT-3.5, Google Bard and Bing in the US, Indonesia, Nigeria, and Taiwan in November 2023. Further work is required to determine the extent and precise nature of potential location bias.
The final model was missing 21 chatbot tests; almost a quarter of the total. Therefore, missing data might introduce bias. Little’s test provided no evidence against MCAR; however, it does not prove MCAR. The sensitivity analyses we performed supported MCAR. Consequently, we do not believe that missing data had an adverse effect on the results or our interpretations.
Pessimism or optimism
There has been optimism that chatbots will benefit patients, physicians, and their interactions. Although generally optimistic, many articles exploring the possibilities have outlined the challenges alongside the potential advantages of chatbots and large language models.73–75 On the other hand, comments in chatbot studies are more pessimistic. For example, “ChatGPT-generated responses… were outdated and failed to provide an adequate foundation for patients’ understanding…” 24 and “… the information presented was difficult to read, with varying quality, understandability, accuracy, and comprehensiveness…”. 76 Even studies that reported high DISCERN scores were circumspect in their appraisal and focused on readability and reference sources, “There was generally high quality in the answers given… but there was a high reading level required… However, it is unclear where the answers originated, with no source material cited” 47 Therefore, the current overall opinion appears to be one of skepticism.
Some chatbots that provide higher-quality responses currently require paid subscriptions. The free-to-use chatbots generally yielded lower-quality responses. This difference could result in access disparity.
Future work
The assessment of chatbot responses to patients’ questions is a new area of research; the first studies included in our analysis were published in 2023. Consequently, there are no established standards for conducting such studies. The GRADE assessment of “very low” and the considerable heterogeneity in the data indicate that some standards should be established. We recommend that evaluators be trained in using the DISCERN tool and practice its use before conducting a study. Having more than two evaluators appears warranted to mitigate the influence of overly positive or negative scores. Even with the subjectivity of DISCERN, it is a validated instrument that provides a granular assessment of responses. This is an advantage over Likert scales, which, although easy to apply, fail to allow inter-study comparisons. The query protocol should be provided, especially a statement of whether the chat history was deleted between questions. The current studies primarily pose questions that do not accurately reflect how patients are likely to interact with chatbots. More natural question sequences will likely improve response quality because of the added context. Greater experimental detail would help synthesis; for example, the date the study was conducted, the specific chatbot used, and the calculator used to assess readability.
Conclusion
The current quality and readability of chatbot responses are inadequate for patient use. Many responses did not reach “good” quality, few were written below a college reading level, and none achieved the recommended sixth-grade level. These shortcomings could taint the perceptions of healthcare professionals and patients, particularly if widely disseminated. However, the potential utility of chatbots should not be dismissed, especially at this early stage in their development. Improvements in question structure, achieved through prompt engineering, will enable readability and quality to be tailored to the knowledge levels of specific patient populations.
Supplemental Material
Supplemental Material - Quality and readability of chatbot responses to patient questions: A systematic cross-sectional meta-synthesis
Supplemental Material for Quality and readability of chatbot responses to patient questions: A systematic cross-sectional meta-synthesis by Peter Whittaker and Mengyan Sun in Health Informatics Journal
Footnotes
Author contributions
Both authors contributed to all aspects of the study.
Funding
The authors received no financial support for the research, authorship, and/or publication of this article.
Declaration of conflicting interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Data Availability Statement
The data that support the findings of this study are available from the authors upon reasonable request.
Registration
Neither the review nor the protocol was registered.
Supplemental Material
Supplemental material for this article is available online.
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
