Abstract
Access to high-quality health information (HI) is critical for everyone involved in the research and management of medical conditions such as spinal cord injury (SCI). Recently, the use of Large Language Models (LLMs) through AI-based chatbots like ChatGPT has become increasingly integral to how people seek and consume HI. While LLMs have been evaluated in various clinical and health domains, there remains a notable gap in the literature regarding their use for SCI-specific questions. We conducted a narrative synthesis to identify the opportunities, challenges, and risks of using LLMs in SCI-related HI tasks, and to provide future direction for researchers, clinicians, and policymakers to better understand this fast-evolving landscape. We searched PubMed, Embase, and Google Scholar up to December 2025 and identified nine primary articles that investigated LLMs in the context of SCI-related queries. Our synthesis of the literature revealed that although there are promising results, these should be taken with caution due to mixed evidence for LLM’s capability to effectively answer SCI-related questions. In addition, the LLM outputs were challenging to read, typically requiring an education level equivalent to a college-level student (grades 14–15) to be adequately understood. We recognize that LLMs can serve as valuable tools for accessing HI in SCI. However, LLMs can also pose significant risks, including the spread of mis- or dis-information that may be inaccurate or even dangerous, which can mislead individuals and caregivers, potentially resulting in detrimental health outcomes. Finally, methodological rigour needs to be improved to produce higher levels of evidence.
Introduction
The landscape of health information (HI) is undergoing a rapid change with the growing use and accessibility of artificial intelligence (AI). Large Language Models (LLMs) such as OpenAI’s ChatGPT or Google Gemini are redefining how caregivers, clinicians, researchers, and other stakeholders interact with the health care system and engage with medical information. LLMs are a form of AI systems designed to process, learn from, and generate human language by analyzing massive amounts of text data.1,2 These models can be used for creating an interactive system (e.g., chatbot), allowing individuals to input questions or information for the LLM to analyze and respond to accordingly.3,4 This capability makes LLMs an incredibly powerful tool for consuming health information.
While LLMs hold promise as tools for HI tasks, these opportunities must be weighed against potential risks. It is becoming increasingly important to examine their accuracy, reliability, and potential impact on patient health, understanding, and decision-making. LLMs, such as those from the GPT family that power ChatGPT, are generalist models trained on a wide range of disparate topics and information sources, with varying degrees of veracity. 5 The non-specific sourcing of information can lead to misrepresentations of evidences with no clear weighting of the levels of evidence to support a LLM’s outputs.5,6 This is of particular concern in HI, which can lead to problems of false or misleading information and the aggravation of infodemics (the rapid spread of health misinformation). 7 For example, during the COVID-19 pandemic, the mismanagement of public health measures and vaccinations significantly impacted patient outcomes and disease transmission. 7 Research has also shown that cancer patients who followed misleading claims promoting unproven, miracle cures had poorer prognosis compared to patients who followed conventional cancer care.8,9 Therefore, accurate HI is critical to supporting the needs of patients, caregivers, clinicians, researchers, and policymakers in efforts to improve health outcomes, functionality, and quality of life.
Research evaluating the potential use of LLMs in health and medical information tasks is rapidly proliferating in all fields, including spinal cord injury (SCI). The use of LLMs for HI tasks in SCI is of great importance due to the heterogeneity and complexity of managing and treating these patients. Clinical guidelines exist, often with different recommendations. In addition, the advent of LLMs can be particularly significant for individuals with SCI and their caregivers, as they often manage complex, lifelong conditions that require frequent decision-making. 10 Reliable, relevant, and easy- to- understand HI is vital for individuals with SCI and their caregivers to promote their health, prevent and manage SCI-related complications, maintain independence, and improve quality of life. 11 During various stages of treatment, recovery, and rehabilitation, individuals living with SCI have a strong desire to enhance their functional capacity, gain independence, and embody a sense of control over their bodies. 12 Similarly, SCI caregivers seek information to support the often-complex health care, legal, and home care resources for their loved ones. 12 Numerous studies have emphasized the gaps in health literacy, underscoring the need for enhanced education of person/family-centred SCI care.13,14 LLMs hold significant opportunities in addressing SCI-related knowledge gaps by simplifying complex medical information and promoting health literacy. In addition, the proliferation of numerous LLMs platforms enables a diverse lived-experience and caregiver population to access reliable information in a user-friendly manner at their discretion. 6 This could be especially valuable for individuals who have often experienced marginalization or insufficient attention in health care settings.
With any fast-growing technological advancements, adoption and research might initially suffer from a lack of methodological rigor and heterogeneous investigations. 15 Therefore, we sought to summarize the current literature in the use of LLMs to answer questions about SCI and identify the risks, the opportunities, and the challenges for both research and real-world use of LLMs for consuming HI by diverse stakeholders. Given the heterogeneity in the published literature on this matter, we present a narrative review of articles evaluating different LLMs or LLM-based chatbots for different HI tasks. With this review, we sought to: 1) provide an overview of the current literature, including types of studies, tasks, and systems evaluated, 2) answer the following three research questions: a) How are LLM-based systems’ questions and prompts designed in SCI? b) How are LLM-based systems’ responses evaluated in SCI? c) What questions about SCI are more challenging for LLM-based systems?
Methods
A comprehensive search was conducted to identify articles up to December 8th, 2025, pertaining to the use of LLMs to answer SCI-related questions. We used PubMed and Embase as primary databases, and reference lists and Google Scholar as additional secondary sources to identify relevant studies. The PubMed search was: (“spinal cord” OR “spinal cord injury” OR “spinal cord injuries”) AND (“large language model” OR LLM OR ChatGPT OR GPT OR “generative AI”) AND (evaluate OR evaluation OR compare OR comparison OR comparative OR benchmarking). The Embase search was: (exp Spinal Cord/or exp Spinal Cord Injury/or (“spinal cord” or “spinal cord injur*” or SCI).ti,ab,kw.) and (exp ChatGPT/or exp Generative Artificial Intelligence/or (large language model* or LLM or ChatGPT or GPT or “generative AI” or “generative artificial intelligence”).ti,ab,kw) and (evaluat* or compar* or benchmark*).ti,ab,kw.
We included peer-reviewed articles and preprints that addressed the application of SCI of any type (e.g., traumatic, non-traumatic, pediatrics, etc.). We excluded non-English, general AI applications (e.g., robotics), and publications that are not related to any type of SCI. Although we did not perform a systematic review, we report a PRISMA (Preferred Reporting Items for Systematic reviews and Meta-Analyses) flow diagram of the study selection process. 16 Included papers were synthesized based on the goals for using the LLM, research design and LLM task categories as defined in the transparent reporting of a multivariable model for individual prognosis or diagnosis (TRIPOD)-LLM. 17 We further annotated the LLM system used, the querying and prompting strategies employed, whether a formal evaluation of different concepts was performed and, if so, how, and the limitations of the overall study design.
Further, we used the TRIPOD-LLM reporting guidelines to assess each publication for the presence or absence of items considered essential for the reproducibility of studies reporting on the evaluation of the performance of LLMs. 17 The 60 TRIPOD-LLM items were independently assessed by three authors (RWD, SS, and ATE), with each included study evaluated by at least two assessors. After an independent assessment, the three assessors resolved conflicts through consensus discussion to maintain consistency across studies. We note that the TRIPOD-LLM reporting guidelines are not quality appraisal or risk of bias tools, but rather a set of guidelines to increase reporting transparency and the reproducibility of articles’ reporting on LLM research. We use them here as a surrogate measure of transparency and reporting quality rather than of the quality of the research and investigations. To our knowledge, no validated structural instrument exists to assess the risk of bias and study quality in studies evaluating LLM performance. We report the presence, absence, or non-applicable (NA) assessment for each TRIPOD-LLM item and included study, as a heatmap. We also report aggregated values per item.
Given that Google Trends was the most common method for identifying themes for questioning and prompting LLMs in the included studies, we performed a similar search to provide context and to note the potential shortcomings of this method. For illustrative purposes, we conducted a Google Trends search as in Temel et al., 18 with search query “spinal cord injury” worldwide since 2004 to present. Results are shown in the Supplementary Data.
All plots and figures were generated in R (Version 4.4.1) 19 using RStudio 20 and the ggplot2, 21 patchwork, 22 and DiagrammeR 23 R packages.
Results
Overview of included studies
Figure 1 shows the study selection diagram. A total of 71 studies were screened, of which 19 were considered for full-text review. Of those, eight were excluded because, although related to SCI, they did not specifically investigate LLMs in the context of SCI (e.g., spine trauma). A total of 3 records related to SCI were excluded because they were not peer-reviewed research articles (e.g., letters to the editor, conference abstracts). A total of nine research article publications that assessed the performance of LLMs in response to SCI-related queries were included for review and synthesis. Table 1 summarizes key characteristics of the included articles. As of this writing, one paper was published in 2023, two in 2024, and six in 2025.

PRISMA flow diagram for the selection process of included articles.
Summary of the Studies Included in the Review
Of the nine included articles, five18,24–27 aimed at evaluating the LLM’s ability to answer general questions related to SCI, which we refer to as the general question and answer task (Q&A). Four of them constructed queries based on internet searches. Only one study assessed the experience of individuals with lived experience interacting with a chatbot. 26 Of the nine, two28,29 studies assessed LLM-based chatbots’ use in providing recommendations to support clinical decision-making, which we refer to as clinical Q&A. One study assessed ChatGPT 3.5 and 4.0 for supporting medical education for non-traumatic SCI by assessing the chatbots’ ability to answer textbook questions and provide readable explanations. 30 One study assessed ChatGPT’s capacity to produce academic work by asking the chatbot to write a review on SCI in the Middle East. 31
In terms of the evaluated LLM systems, all nine studies assessed some form of GPT-based system, with 8 of them using different versions of ChatGPT while one queried GPT-4o directly through API implementation. Only three studies benchmarked different systems (Gemini Advanced, Gemini-1.5 Pro, Claude-3.5 Sonnet, Llama-3.1 70B, DeepSeek-V3),25,27,29 while only one compared two versions of ChatGPT. 30 Of all the evaluation instances, only two (Llama-3.1 70B and DeepSeek-V3) used open-source LLMs.
How are LLM-based systems’ questions and prompts designed in SCI?
Of the five included studies that evaluated a general Q&A task, four generated the topics for their questions using Google Trends,18,24,25,27 which provides aggregated data on the searches that people do in the Google search engine. The four studies used a similar approach by identifying the most frequently searched keywords related to SCI from Google Trends, then used that information to develop queries to post to the LLM system. Temel et al. 18 identified the 25 most commonly searched keywords and inputted them directly into ChatGPT one by one and in order of prevalence, given Google Trends. Note that these are phrases with four or fewer words each. Based on the report, the authors did not ask a specific question to ChatGPT, but rather just prompted the phrases. Özcan et al. 24 provided more details on the use of Google Trends to identify key topics. Using the top 3 most searched keywords for SCI, they developed a total of 47 questions by the authors under the guidance of an experienced clinical specialist. Their questions were divided into “General Information”, “Complications”, and “Treatment”. They then posted the developed questions to ChatGPT directly, one by one, and with no further information. Li et al. 25 went a step further by combining the 20 most frequently searched SCI questions from Google Trends with sourcing questions from SCI-related websites (National Spinal Cord Injury Association, International Spinal Cord Society, Miami Project to Cure Paralysis, and Christopher & Dana Reeve Foundation). Using this data, a panel of three spinal surgeons refined 37 questions, which were then posted to four different LLM systems. The authors provided further details in prompt design and implementation, where an introductory prompt, “I have some questions about spinal cord injury”, was used before each question to provide context. Finally, Lau et al. 27 used the five most common SCI-related keywords on Google Trends to develop 48 questions. The questions were then sent via an Application Programming Interface (API) to GPT-4o and Deepseek-V3 with a temperature of 1 (an LLM parameter that controls the randomness and “creativity” of the responses, with 1 being a balance between predictability and variability). No further information was included in the prompt. Given the variability in responses, they posted each question 3 times independently.
The two studies evaluating LLM systems for clinical Q&A28,29 used very similar strategies to develop their questions. Saturno et al. 28 used the “Management of acute cervical spine and spinal cord injury” guidelines developed by the US Congress of Neurological Surgeons (CNS) as a ground truth. They collected the CNS recommendations across 21 topics into “clinical assessment”, “diagnostic”, and “treatment”. From these three groups, an experienced spine surgeon generated 36 questions, which were posed directly to ChatGPT-4.0 only once, without further contextual information or follow-up. Yu et al. 29 used the American College of Surgeons Best Practices Guidelines: Spine Injury as ground truth, which contains recommendations across 21 relevant sections for SCI and 52 key points to guide clinical decision-making. The authors transformed the 52 key points into questions, ensuring similar language to the guidelines and verifying the questions for clinical relevance by a clinical expert. The questions were then posed to ChatGPT-4o and Google Gemini Advanced only once, with no further contextual information. We note that both studies have overlapping authors, which might explain similarities in their procedures. In both cases, they justify the prompting without any further contextual information to assess the LLM system’s baseline capabilities. Yu et al. 29 also rationalize the use of single-time prompting, explaining it reflects “typical clinical use, where clinicians or patients are unlikely to pose the same question to LLMs multiple times”.
García-Rudolph et al. 30 developed a different prompting strategy. Since their goal was to evaluate ChatGPT for medical education, the authors designed a prompt that included a question and multiple-choice options, with each option on a new line. A total of 50 questions were obtained from “Chapter 18: Non-Traumatic Spinal Cord Injury” from the “The Essential Spinal Cord Injury Medicine Question Bank”. The questions were then posed to ChatGPT-3.5 and ChatGPT-4.0 one at a time without further information. Importantly, the authors verified that the answers, explanations, or related explanations were not indexed on Google before the release of the used models.
Hose et al. 26 investigated individuals with SCI/D (SCI or disease) interacting with ChatGPT using structured scenarios simulating real-world symptom management challenges. They developed the scenarios from the Urinary Symptom Questionnaires for Neurogenic Bladder (USQNB) validated instrument to evaluate the use of chatbots in supporting urinary symptom management. No further information was provided on the prompts and interactions used by the participants, or the instructions provided in the scenarios. Finally, Aly and Aly 31 used ChatGPT to generate a research review. They described what information they wanted to get from the chatbot, but did not provide information on the prompts, or questions used.
How are LLM-based systems’ responses evaluated in SCI?
In general, the evaluation of LLM performance in SCI has centred around the readability, accuracy, and quality of the generated responses. Readability refers to the complexity of the text and how easy it is to comprehend and read. While readability measures are generally comparable and standardized across articles, the definition and metrics used for accuracy and quality vary. We noted evaluations of accuracy as those assessing the correctness of the information provided by the LLM system, either by expert comparison to a reference text, subjective evaluation, or by a downstream task (i.e., correctly answering textbook medical education questions). Quality is a more comprehensive evaluation of how the information is presented to the user, often including aspects such as completeness, understandability, and usefulness.
Readability
In the included articles, readability is assessed using common quantitative metrics that measure lexical and/or syntactic complexity of a text, providing either a readability score or the reading grade level needed to understand the text (Table 1). Li et al. 25 and Temel et al. 18 examined the readability of responses generated to SCI-related inquiries using the Flesch-Kincaid reading ease score to derive grade levels. Both studies reached similar conclusions, with LLM results having low readability, indicating that in order to understand the given LLM output, individuals and/or caregivers would require approximately 14 to 15 years of formal schooling or a college-level education. Li et al. 25 found that readability is generally low across the four tested chatbots, suggesting it is not a characteristic of a particular system. García-Rudolph et al. 30 evaluated the readability of ChatGPT-4 explanations of educational topics in non-traumatic SCI using a battery of readability metrics. They found responses to be significantly more complex than the corresponding textbook explanations. Both Li et al. 25 and García-Rudolph et al. 30 also evaluated readability after prompting the LLM systems to reduce response complexity, finding that readability improved significantly in both question-answering and medical education explanation tasks.
Accuracy
Six articles investigated whether the generated LLM responses were accurate with respect to a reference (either a text reference or experience).24,25,27–30 Yu et al. 29 and Saturno et al. 28 investigated the use of LLMs for answering SCI-related inquiries from a clinical Q&A perspective. In both cases, accuracy was determined by concordance with established guidelines, evaluated by clinical experts, using a binary classification (concordant or nonconcordant). Both studies found that response accuracy varies by topic and question, with similar overall results: 61% and ∼69–73% concordant responses in Yu et al. 29 and Saturno et al., 28 respectively. For those responses considered nonconcordant, the majority (71% to 78%) were deemed to contain insufficient information, while only around 20–30% were considered to provide contradictory information. Yu et al. 29 found that questions categorized asfor informational, diagnostic, or treatment-related had similar quality, with no statistical differences between ChatGPT and Gemini. Conversely, Saturno et al. 28 found that while treatment questions had the highest overall concordance scores (70.8%), this was reduced for diagnostic questions (57.1%), and further for clinical assessment questions (20%). Importantly, Saturno et al. 28 found that 80.8% of the responses by ChatGPT were recommendations supported only by level II/III evidence. They conclude that, “…higher volume of lower-quality evidence may potentially bias the model’s performance to be more concordant with strictly level II/III recommendations.” Nonetheless, both studies suggested that, while the responses demonstrated only moderate accuracy, ongoing advancements could support their potential implementation in clinical settings.
For general Q&A, there were mixed methodologies and conclusions regarding the accuracy of responses. Li et al. 25 conducted a comparative analysis of four LLM models, assessing accuracy using a custom 3-point scale (1—Poor: for inaccurate, misleading responses; 2—Borderline: for minor factual errors unlikely to mislead; 3-Good for accurate, clear information) performed by three experienced researchers. They found varying degrees of accuracy across the models, highlighting issues related to misinformation and the lack of depth in the responses. Özcan et al. 24 assessed the reliability (accuracy of information) of responses using a 7-point Likert-type scale ranging from (1) Completely unsafe, to (7) Absolutely reliable, 32 and concluded that the model performed relatively well, with high agreement between raters. However, the study examined only ChatGPT, the model Li et al. 25 identified as having the highest accuracy, and the authors did caution against the potential risk of misinformation based on the type of question being asked. Lau et al. 27 used a more comprehensive evaluation framework (Safety, Consensus with Guidelines, Objectivity, Reproducibility, and Explainability; S.C.O.R.E) 33 evaluated as a 5-point Likert scale. We consider the Safety and Consensus with Guidelines as components that evaluate accuracy in their work. They found that both GPT-4o and DeepSeek-V3 received high ratings on those two components (average ratings ranging from 4.2 to 4.49), with no statistical differences between models.
Finally, García-Rudolph et al. 30 provided a more objective measure of accuracy by evaluating ChatGPT’s ability to answer multiple-choice questions correctly. They found that both ChatGPT-3.5 and ChatGPT-4 exhibited high levels of accuracy, with scores of 84% and 96%, respectively.
Quality
The measures of quality varied across studies. Temel et al. 18 assessed responses using the EQIP (Ensuring Quality Information for Patients) score, 34 a validated instrument for assessing the quality of written health care information. The authors concluded that there were significant concerns regarding response quality (mean EQIP = 43.02%, “serious problems with quality”). Özcan et al. 24 used a 7-point Likert-type scale to assess the quality of responses through a usefulness scale ranging from (1) Not useful at all, to (7) Extremely useful. 32 Contrary to Temel et al., 18 the authors offered a generally positive assessment of response quality (i.e., usefulness), while cautioning against the potential risk for misinformation. Li et al. assessed quality using a variety of validated tools, including the EQIP score, the textual coherence index (TCI; a measure of logical coherence and semantic fluency), the redundancy index (RI; a measure of textual redundancy), and the semantic similarity between question and answer based on the Bidirectional Encoder Representations from Transformers language model (BERT score; a measure of semantic relevance). In addition, they evaluated response comprehensiveness with a 5-point scale ranging from (1) Not comprehensive, to (5) Highly comprehensive. Similar to Özcan et al., 24 they found response quality to be relatively high across varying LLM models, with a median comprehension score of 4 for all four models. However, there are highlighted variations in quality across models. 25 Finally, Lau et al. 27 found high levels of the quality of the responses from both GPT-4o and Deepseek-V3 when considering their 5-point Likert scale scores in Objectivity, Reproducibility, and Explainability.
What questions about SCI are more challenging for LLM-based systems?
A subset of the included studies evaluated subgroups or categories of questions, finding that the readability, accuracy, and quality of responses vary across question types and across LLM systems.18,24,25,27–29
In the general Q&A task, Temel et al. 18 classified their phrases into categories based on the EQIP tool, finding that responses related to “Discharge or aftercare” questions had the highest reading complexity compared to responses related to “Condition or illness” and “Miscellaneous” (catch-all category for those topics that do not fit the four categories in EQIP) categories. They found no statistical differences in the EQIP scores between categories, with the average EQIP suggesting concerns about the quality of the responses across all topics. Özcan et al. 24 classified their questions based on the three most searched themes from Google Trends. They found that SCI-related questions about “General information” had the lowest scores for accuracy (reliability) and quality (usability), while questions related to “Complications” had the highest scores for accuracy, and questions related to “Treatment” had the highest scores for quality. Interestingly, they found statistical differences for their measures across categories only in one of three raters. Li et al. 25 classified their questions based on six groups: pathogenesis, risk factors, symptoms, diagnosis, treatment, and prognosis. In general, they found that the four LLM systems they evaluated produced “good” responses across all six topics in their consensus-based accuracy assessments. Across LLM systems, the most “poor” rated questions were about “Clinical presentation”, “Diagnosis”, and “Treatment”. Lau et al. 27 grouped their questions into different categories but did not provide specific comparisons across groups or subgroup analyses.
For clinical Q&A, Saturno et al. 28 found that questions about “Treatment” were the most accurate (concordant) with clinical guidelines, followed by questions about “Diagnostic” and “Clinical assessment”, with the latter being the most nonconcordant set of questions. They also show that for most nonconcordant questions, the LLM system failed to provide sufficient information, rather than giving a contradictory one. Yu et al. 29 classified their questions into “Informative”, “Diagnostic”, and “Treatment”, with no differences in accuracy (concordance) among the groups, and all groups showed high levels of concordance. Similar to Saturno et al., 28 they find that most nonconcordant questions were due to insufficient information. They did not find statistical differences between ChatGPT and Gemini.
For medical education tasks, García-Rudolph et al. 30 did not provide group analysis for their questions; however, they did report the readability scores for the responses to each specific question. They found high variability in readability, with responses to some questions scoring as highly complex. Nonetheless, as noted above, on average, they found all responses to be poor in readability scores. They only provided aggregated summaries for response correctness in the multiple-choice task.
Reporting quality of included studies
To assess the reporting quality of the included studies, each article was classified according to its research design and LLM task, as identified by the TRIPOD-LLM guidelines, and evaluated for adherence to the most relevant sections (whether an item was present or absent). Figure 2 summarizes the results. None of the included studies mention the TRIPOD-LLM guidelines nor any other reporting standard. This is expected, as only one of the included studies was published after the publication of the TRIPOD-LLM. Nonetheless, the TRIPOD-LLM items serve as a guide to evaluate reporting transparency in these articles. Of the 60 items, 16 did not apply to any of the included studies, and 16 were relevant only to a subset of studies. We refer the reader to the TRIPOD-LLM 17 guidelines for definitions and explanations of each item.

Summary of the TRIPOD-LLM checklist adherence evaluation.
Of the applicable items in the TRIPOD-LLM list, Title (1), Abstract (2a to 2 l), and Introduction (3a to 4) were the sections with higher adherence across studies. On those sections, the item that was absent more often (absent 4 out of 9 studies) was 2 b “Provide a brief explanation of the health care context, use case and rationale for developing or evaluating the performance of an LLM”. We found that in all these cases, the studies did not describe the context or rationale for evaluating the LLM in the Abstract. Nonetheless, that information was available in the Introduction for most articles (present in 8 out of 9). The Methods section contained the major number of items (5a to 15), with mixed adherence. In general, good adherence was found across included studies for items in section 5 “Data”, with the majority of studies describing the data used to generate questions. In contrast, in section 6 “Analytical Methods”, we found that the majority of studies did not specifically or clearly report on item 6d “Specify the initial and post-processed output of the LLM” (absent in 8 out of 9). Of the Methods, other items with low adherence were: item 9a “If research involved prompting LLMs, provide details on the processes used during prompt design, curation, and selection” (absent in 4 out of 9); item 12 “Report compute, or proxies thereof required to carry out methods” (absent in 7 out of 7); item 14e “Provide details of the availability of the study data” (absent in 4 out of 9); and item 14f “Provide details of the availability of the code to reproduce the study results” (absent in 7 out of 7). For the Results section, only item 17, “Report LLM performance according to pre-specified metrics and/or human evaluation” applied to all studies, with high adherence (present in 7 out of 9). Finally, although most items in the discussion had high adherence, item 19e “Describe how poor quality or unavailable input data should be assessed and handled when implementing the LLM, i.e., what is the usability of the LLM in the context of current clinical care” was absent in 5 out of 8 studies. Item 19f “Specify whether users will be required to interact in the handling of the input data or use of the LLM, and what level of expertise is required of users” was absent in 8 out of 8 studies.
Discussion
The use of LLM-based systems, mostly conversational chatbots, for HI-related tasks (e.g., general public health Q&A, clinical recommendations) is increasing. The 2024 KFF Health Misinformation Tracking Poll: Artificial Intelligence and Health Information 35 found that about one in six adults say they use AI chatbots at least once a month to find HI, reaching one in four for adults under 30 years of age. Yet, among those using AI chatbots for HI, only 36% trust the chatbot to provide reliable HI. 35 In a recent survey of general practitioner doctors in the UK, 25% reported using generative AI tools in clinical practice. 36 With the proliferation of these tools and their prevalent use in society for health-related questions, the health research community has naturally evaluated them for different tasks, including those related to HI in SCI. This is evident from the increase in published studies in the last year compared to 2024 and 2023, and there is no reason to think this proliferation of research articles will stop any time soon. Here, we synthesized the current literature on the evaluation of LLM-based systems’ capabilities to answer SCI-related questions. Based on our synthesis of the literature and observations, we identified risks, challenges, and opportunities for the use of LLM-based systems in SCI and summarized current gaps in research.
The promise and risks of LLMs for SCI HI
Overall, the synthesized studies agree on the potential utility of using LLM-based systems in HI tasks in SCI. For general Q&A use, the consensus is that, for the majority, LLM chatbot responses are accurate and of acceptable quality across the evaluated systems. An exemption is the work by Temel et al., 18 reporting on low quality levels as measured by EQIP. Although studies are not comparable due to heterogeneity in design and methodology, Temel et al. 18 posed keyword phrases rather than questions to ChatGPT, which, lacking context, could explain lower-quality responses than in other studies. Although the overall result is promising, studies also report on variability in the accuracy and quality of responses depending on the model used, but more importantly, on the type of question asked. This is a significant concern, since, in real-world situations, these tools do not provide measures of confidence or uncertainty around their responses. Thus, users are left to rely on their own knowledge and judgment to assess the reliability of the information, which is less than ideal for general Q&A, where we need to account for varying levels of health literacy. Another important concern for general Q&A is the low readability of responses, as reported for all studies that included a metric for it. The requirement of high levels of education to be able to understand the responses increases the risk of misunderstanding, which poses a serious health risk. Fortunately, prompt instruction is an effective way to increase response accessibility, 37 as shown by Li et al. 25 and García-Rudolph et al. 30 in the SCI context.
The lack of readability in responses offers important considerations for individuals living with SCI and their caregivers. A common challenge and limitation that individuals with SCI and their caregivers encounter relates to health literacy and readability standards, which exacerbate accessibility challenges in knowledge mobilization. Agarwal et al. 38 conducted a readability assessment on 104 sections of educational materials from ten different websites focused on SCI and found that the language utilized required advanced reading comprehension compared to the average American. 38 The National Institute of Health (NIH) recommends that all HI provided to individuals and caregivers should be written at a sixth- to eighth-grade reading level, to accommodate for varying degrees of health literacy.39,40 Hence, ideally LLM-generated content should be clear, simple, and structured to meet the needs of the diverse SCI population. However, as mentioned above, our review of the literature reveals that content generated by LLMs is often difficult to read, on average requiring a comprehension level equivalent to approximately 14 to 15 years of formal education.18,25,29 This discrepancy highlights a barrier to accessibility and underscores the need for LLMs to improve readability and simplify complex medical information.
For clinical HI tasks such as medical education and clinical recommendations, the current evidence also suggests promising utility. Although the number of studies is limited to three, the general trend is that LLM chatbots provide accurate, concordant responses consistent with clinical guidelines and textbook references for most questions. Nonetheless, some of the assessed studies also provide evidence of heterogeneity in the accuracy of information depending on the nature of the question. In most cases, responses that fail evaluation for concordance with clinical guidelines do so by lacking essential information (e.g., an incomplete response) rather than by providing contradictory information.28,29 This is promising and suggests that training systems with more information and providing better context during prompting could help with this issue. Nonetheless, between 20% to 30% of the times, questions were not considered concordant with guidelines due to contradictory information, which poses a higher health risk. 28 A related issue was noted by Saturno et al. 28 by showing that 80% of responses to clinical recommendation questions were based on evidence levels II and III. This is an important finding that speaks to the quality of the data used to train general-purpose LLMs, and indicates the need to develop fine-tuned domain-specific models for SCI tasks using high-quality evidence.
Identified gaps in the literature, recommendations, and future research
Despite our comprehensive search of the existing literature, the use of LLMs for consultation and Q&A in SCI is understudied. As of December 2025, nine studies have been identified that assessed the use of LLMs in SCI, with high heterogeneity in study design, different evaluation domain definitions and a wide range of metrics, which limits our capacity to compare studies. We also found the quality of the reporting in the assessed studies variable, with good adherence to reporting guidelines in abstracts, introductions, and portions of the discussion. By contrast, our evaluation indicates a paucity of methodological reporting, in particular regarding prompting, response processing, and data availability. This undermines the reproducibility of published studies, challenges evidence synthesis, and highlights the need for greater transparency and reporting rigour in studies evaluating LLMs in SCI.
Most included studies focused exclusively on different versions of GPT models through ChatGPT. Currently, there is little evidence on the use of systems based on other LLMs, and a notable gap in research on HI using LLMs in SCI. Quantifying and characterizing the use of generic AI-powered chatbots and other tools is necessary to inform the impact and the development of SCI-specific tools. Moreover, most evaluated studies assess only one or two aspects of the LLM system’s responses. The use of comprehensive evaluation frameworks is lacking in general, with the exception of perhaps Li et al. 25 and Lau et al., 27 who used multi-domain evaluations for different LLMs. Although we did not perform a formal assessment of study quality, our consensus opinion is that Li et al. 25 is among the most rigorous, well-conducted, and transparently reported studies. Furthermore, most studies relied on subjective evaluations of accuracy and quality, often using non-validated ordinal scales. Validating some of these metrics and the incorporation of more objective measures will be important. One key aspect of objective metrics is their potential for automation, which could provide mechanisms for online real-time evaluation of responses while the user is interacting with the system and offer measures of confidence or quality of the response. Moreover, none of the included studies assessed model fairness, key to equitable AI systems. 41
In terms of identifying the questions to ask for evaluating LLMs, it is unclear whether Google Trends only reflects the search by the individuals living with SCI, as assumed by all the included work, or by students or other individuals seeking information on the topics of interest. We repeated the same Google Trends search and identified a seasonality pattern with peaks in April and November each year (Supplementary Fig. S1). The seasonality effect suggests regular events, which might be difficult to justify as searches by those with lived experience. Seasonal events related to SCI might be international scientific conferences and medical school examinations. To our knowledge, there is no research on how people with SCI specifically use Google searches for HI and how to disentangle the search trends from the broader population. In addition, only one study examined the perspectives of individuals with SCI/D through scenario-based interactions. Thus, there is a paucity of research that incorporates those who might benefit most from AI technologies for HI in SCI, such as individuals with lived experience. Collaborations among individuals living with SCI, caregivers, researchers, and health care professionals in the co-development of LLM tools will help ensure that these technologies are aligned with real-world needs and enhance their practical usability.
The identification of which questions and in which scenarios LLMs perform well or not will be important for delimiting the limits of the current technology. Although the current literature provides an initial assessment, a thorough examination of the reasons why some questions are not answered adequately is lacking. This limitation affects accuracy and contextual relevance, as generic LLM tools lack access to high-quality, SCI-specific datasets to address the medical, rehabilitative, and psychosocial inquiries. 25 For example, the SCIRE project synthesizes research findings on SCI care practices to provide valuable information for health care professionals, scientists, policymakers, and individuals with SCI. 42 However, general LLMs often fail to integrate these specialized databases, thereby limiting their ability to provide precise and relevant SCI-related knowledge. Furthermore, the lack of real-time updates in training data further affects the reliability and relevance of LLM-generated outputs. 31 This can result in biased, outdated, or misleading information and abstract summaries that can negatively affect users, especially individuals living with SCI and their caregivers. Hence, tailoring and integrating LLM systems into SCI-specific knowledge bases is necessary to ensure LLM outputs pertaining to SCI knowledge dissemination are accurate, contextually relevant, and well-referenced. Although identified as a need by some of the included studies, no publication as of this writing has evaluated and benchmarked general-purpose LLMs against SCI-augmented systems. To fully harness the benefits of AI and LLMs in SCI HI tasks, future research should focus on developing SCI-specific AI systems, ensuring that these systems are trained or augmented on domain-relevant datasets to improve accuracy and relevance. Those could be developed by fine-tuning models and through retrieval augmented generation (RAG) systems.43,44 This could be significantly assisted by developing benchmarking datasets in SCI for systematically developing and evaluating SCI-specific systems. It will also be important for LLM-based systems to provide user-friendly reports of the quality and credibility of the sources used for responses. This could help users gauge how much trust to put into specific pieces of information.
We also recommend that greater oversight be given to the factual accuracy and reliability of LLM-generated content, specifically for vulnerable populations such as individuals living with SCI. We strongly believe that policymakers and regulatory bodies should take action to protect vulnerable populations from mis- and disinformation. Perhaps groups such as the North American Spinal Cord Injury Consortium (NASCIC) could play a role in advocating and ensuring trustworthy AI development and use on behalf of the community. In addition, efforts should also be made to address health literacy gaps by tailoring LLM-generated content to appropriate readability levels, ensuring accessibility for individuals living with SCI and caregivers who have varying levels of education, health, and digital literacy. We also recommend the integration of standardized evaluation frameworks and readability metrics to assess the trustworthiness, quality, and usability of LLM-generated content.
An important consideration in evaluating LLMs is that aggregate accuracy metrics alone are insufficient for evaluating clinical and health advice, particularly when some errors may carry disproportionate risk for individual patients. The included studies that evaluated LLM responses’ accuracy did not consider the different risks associated with inaccurate responses. This should modulate the current generalized excitement in the literature about the use of LLMs for HI tasks. A chatbot that is “mostly correct” may still be very unsafe if the incorrect responses are severe, misleading, or actionable in harmful ways. Measuring risk and consequence in the advice produced by LLMs is at the limit of research in AI evaluation. The assessment of chatbot advice should move beyond raw correctness and accuracy toward risk- and consequence-aware evaluation. In practice, this might include stratifying errors by potential clinical severity; 45 identifying “catastrophic” failure or high-risk scenarios (e.g., unsafe triage, contraindicated recommendations) for a worst-case evaluation; and evaluating whether errors are recoverable through downstream safeguards such as clinician oversight or uncertainty disclosure. These will help ensure that model evaluations better reflect patient safety considerations rather than statistical performance alone. 46 From this perspective, a small number of high-risk errors may be more consequential than many low-risk inaccuracies, even if overall accuracy appears high.
Furthermore, we note other methodological shortcomings identified in the evaluated research that might undermine the work’s robustness and that offer opportunities to improve research quality. Most of the included studies posed questions only once, with no follow-up and no further contextual information in their prompts. This brings two major limitations: 1) since LLM responses are probabilistic, repeated independent assessment per question is important to assess intra-question variance across metrics and models. Lau et al. 27 posed each question three times independently. This can provide a measure of response repeatability and limit the effects of good or bad responses by chance. 2) It is unlikely that most people interact with LLM chatbots without providing context (e.g., explaining their health condition) and interacting by asking the question in several ways or asking for clarifications. Thus, most evaluated studies may not reflect real-world interactions, risking the generalizability and transportability of the results. Moreover, LLMs are sensitive to paraphrasing, which is not studied in the current SCI literature. Importantly, a portion of the included studies did not statistically assess their claims, yet made inferential interpretations from their evaluations. Moving forward, robust study designs, evaluations, and analyses will be required. Finally, studies should benchmark multiple LLM-based systems using a battery of well-defined concepts and validated indicators. The inclusion and more robust evaluation of open-source models such as Llama are imperative, and more transparent reporting is necessary to increase reproducibility.
Limitations
The limitations of this study include the small number and heterogeneity of studies conducted, which limit our synthesis. Additionally, although we noted some methodological shortcomings in the current literature and provided recommendations to improve methodological rigour in future studies, we did not formally assess the quality of the research. To our knowledge, there is no validated tool for risk of bias and quality assessment in studies evaluating LLMs. Nonetheless, we used the TRIPOD-LLM checklist as a surrogate for transparency and quality in reporting, pointing to potential issues in reproducibility in the included studies. Finally, we acknowledge that the topic of research is not mature and is quickly changing, which makes it challenging to compare research studies even in the last two years.
Conclusions
This paper highlights the transformative potential of LLMs in SCI. However, as we highlighted, current challenges related to factual accuracy, biases, and readability make it difficult to fully integrate LLMs into SCI HI tasks. Despite these limitations, ongoing advancements in AI technologies, coupled with the development of SCI-specific curated document databases and robust evaluation frameworks, offer promising solutions to address these challenges. There is a pressing need to increase both the quantity and quality of research in this area. Ensuring accessible, high-quality, and personalized information for clinicians, individuals living with SCI, caregivers, health system managers, and policy makers is essential to improving health outcomes and quality of life.
Transparency, Rigor and Reproducibility Statement
This narrative review was conducted with attention to transparency, rigor and reproducibility in the selection and synthesis of articles included in the review, and the interpretation of results. The review question, scope and inclusion criteria were defined in advance. Search strategies and databases reviewed are reported in the Methods section, to clarify how literature was identified. Studies were appraised for their relevance, methodology and potential bias. Regarding reproducibility, although narrative reviews are often interpretive in nature, a full list of included studies is provided (Table 1). The steps for literature identification and selection are outlined to support reproducibility of process and reasoning. A full list of references is provided to ensure clarity to facilitate future review updates.
Authors’ Contributions
The authors contributed as follows: R.K.D.: investigation, writing original draft; S.S.: investigation, writing review and editing; M.P.: writing original draft; K.L.N.: writing review and editing; P.R.Y.: writing review and editing; O.D.: writing review and editing; M.M.R.: writing review and editing; N.F.: writing review and editing; J.C.: review and editing; V.K.N.: writing review and editing; A.T.E.: conceptualization, methodology, visualization, investigation, writing review and editing, funding acquisition, supervision.
Footnotes
Author Disclosure Statement
The authors have no competing interests to disclose.
Funding Information
This project received support from a Craig H. Neilsen Foundation Strategic Grant (UW2025-2028).
Supplemental Material
Abbreviations
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
