Abstract
Study Design
Comparative Analysis and Narrative Review.
Objective
To assess and compare ChatGPT’s responses to the clinical questions and recommendations proposed by The 2011 North American Spine Society (NASS) Clinical Guideline for the Diagnosis and Treatment of Degenerative Lumbar Spinal Stenosis (LSS). We explore the advantages and disadvantages of ChatGPT’s responses through an updated literature review on spinal stenosis.
Methods
We prompted ChatGPT with questions from the NASS Evidence-based Clinical Guidelines for LSS and compared its generated responses with the recommendations provided by the guidelines. A review of the literature was performed via PubMed, OVID, and Cochrane on the diagnosis and treatment of lumbar spinal stenosis between January 2012 and April 2023.
Results
14 questions proposed by the NASS guidelines for LSS were uploaded into ChatGPT and directly compared to the responses offered by NASS. Three questions were on the definition and history of LSS, one on diagnostic tests, seven on non-surgical interventions and three on surgical interventions. The review process found 40 articles that were selected for inclusion that helped corroborate or contradict the responses that were generated by ChatGPT.
Conclusions
ChatGPT’s responses were similar to findings in the current literature on LSS. These results demonstrate the potential for implementing ChatGPT into the spine surgeon’s workplace as a means of supporting the decision-making process for LSS diagnosis and treatment. However, our narrative summary only provides a limited literature review and additional research is needed to standardize our findings as means of validating ChatGPT’s use in the clinical space.
Keywords
Introduction
Spine surgeons are tasked with making polychotomous decisions every day when it comes to providing effective patient care. As spine research and technology continue to advance, the role of guidelines has become prominent in the surgeon’s decision-making process. 1 Degenerative lumbar spinal stenosis is one such spinal etiology that continues to advance in terms of accepted practices for its diagnosis and treatment. 2 In 2011, The North American Spine Society (NASS) Clinical Guideline for the Diagnosis and Treatment of Degenerative Lumbar Spinal Stenosis attempted to alleviate differences in practices by evaluating the highest quality literature and formulating evidence-based recommendations through expert consensus and analysis. 3 However, as new research continues to emerge and outpace the development of updated guidelines, a gap has formed between what is recommended and what is possible with the latest advancements.
Recently, Chat Generative Pre-trained Transformer (ChatGPT) has emerged as a potential resource to address these challenges. ChatGPT is an artificial intelligence (AI) language model that utilizes deep learning to generate responses to human prompts on command. The latest version of ChatGPT (GPT-3.5) was trained on 45 terabytes of text data, including news articles, books, journals, and other sources that allow it to readily respond to a wide array of queries. Thus, the advent of ChatGPT has led to its interest in its ability to synthesize medical knowledge and make clinical decisions.4-6
The purpose of this paper is to analyze and compare ChatGPT’s recommendations to the clinical questions and recommendations proposed by the NASS Clinical Guideline for the Diagnosis and Treatment of Degenerative Lumbar Spinal Stenosis. Our study aims to qualitatively evaluate ChatGPT’s performance when responding to clinical questions related to lumbar spinal stenosis and provide a review of the literature to corroborate or refute its findings.
Methods
16 questions were obtained from the most recent NASS Clinical Guideline for the Diagnosis and Treatment of Degenerative Lumbar Spinal Stenosis, revised in 2011. All questions were screened and then imputed into ChatGPT with no modifications. Two questions from the guidelines were excluded since the NASS guidelines did not report an answer or any evidence supporting their claim.
For the included 14 questions, a new ChatGPT window was created for each question in order to avoid any biases from the prior questions. After ChatGPT generated a response, it was recorded verbatim onto our database. The answers from ChatGPT were then compared to those presented by the guidelines and were assessed for accuracy based on what the guidelines had presented.
For our narrative review, we searched the PubMed, Ovid, and Cochrane research databases between January 2012 and April 2023 for articles about diagnosis and treatment of lumbar spinal stenosis that answered the questions proposed by the NASS guidelines. Our search strategy included key words such as lumbar spinal stenosis or spinal stenosis, surgery, decompression, fusion, medication, opioids, epidural injections, steroids, physical therapy, exercise, manipulation, bracing, diagnostic test, history, and clinical guidelines. Eligible studies (1) evaluated adult patients (≥18 years) who had undergone an operative or non-operative intervention for LSS; (2) reported postoperative outcomes that were related to the guidelines’ questions; and (3) included data that analyzed changes in function or outcomes compared to baseline. All studies before 2012 were excluded since these findings were already incorporated into the 2011 NASS guidelines. We excluded animal studies, case reports, and articles not available in English or that did not provide data related to the questions posed by the guidelines. Two reviewers independently screened the titles and abstracts of the 4487 citations retrieved from the initial search based on our inclusion criteria. Afterwards, two reviewers independently screened the included full-text articles. Conflicts between the reviewers were resolved by a third investigator. After performing full-text screening, 40 studies were considered eligible.
Informed consent was not required for this project, and there was no need for Institutional Review Board approval as no patient data was involved.
Guideline Questions Characteristics
NASS Guideline recommendations for Lumbar Stenosis vs ChatGPT. Questions and recommendations from the NASS guidelines recorded for responses that had sufficient evidence to make a claim. Responses from ChatGPT were also recorded and graded for precision.
Insufficient Data to Make Conclusions Based on NASS Guidelines vs ChatGPT Recommendations. Questions and Recommendations From the NASS Guidelines Recorded for Responses That had Insufficient Evidence to Make a Claim. Responses From ChatGPT Were also Recorded and Graded for Precision.
Assessment of ChatGPT’s Performance
Our study revealed that ChatGPT has the potential to provide accurate recommendations for interventions and treatments in the management of lumbar stenosis. Our findings can be categorized into two themes: (1) The precision of ChatGPT in providing recommendations compared with the NASS guidelines, and (2) The capability of ChatGPT to provide accurate recommendations in the absence of evidence presented within the NASS guidelines.
ChatGPT vs Guidelines Recommendations
Definition and History
The NASS guidelines recommended answers for two of the three questions that fell under the category of definitions and natural history of the condition. When comparing the responses to the question regarding the working definition of lumbar stenosis, ChatGPT was able to generate a definition that closely resembled what the guidelines presented, including the correct anatomical aberrations and associated symptoms. However, when asked about the natural history of lumbar spinal stenosis, ChatGPT failed to mention some of the criteria presented in the guidelines. For instance, ChatGPT failed to mention the favorability rate of the diagnosis, specifically that mild to moderate stenosis can be favorable in one-third to one-half of patients. These findings have been recorded across numerous studies in the literature throughout the past decade.7-10 Furthermore, the guidelines mentioned the rarity of rapid neurological impairments following a lumbar stenosis diagnosis, a symptom that ChatGPT failed to comment on. There have only been a handful of case reports of neurological decline within patients with spinal stenosis, suggesting that it is indeed an atypical finding. 11
Diagnostic Tests
Within the NASS guidelines, there was only one question that assessed the role of appropriate diagnostic tests for lumbar stenosis. We found that ChatGPT gave a comprehensive response that adequately matched many major themes highlighted by the guidelines. When discussing imaging tests that can be implemented, ChatGPT accurately presented the utilization of X-rays, MRIs, and CT scans in the diagnosis of spinal stenosis. Several studies in the current literature have discussed the diagnostic benefits of using each of the aforementioned tests.12-15 ChatGPT also accurately presented information on the importance of nerve function tests that were highlighted by the guidelines, including electromyography and nerve conduction studies. These findings by both the guidelines and ChatGPT have been tested across multiple papers in the last few years and have been found to be an accurate diagnostic tool.16-18 It should be noted that although ChatGPT did not rank the different tests in terms of their efficacy as the guidelines did, it communicated the importance of seeking a healthcare professional to help make a decision on the most appropriate diagnostic test that should be utilized.
Non-Surgical Interventions
The NASS guidelines had adequate evidence to provide recommendations for only three of the seven questions related to non-surgical interventions in the treatment of lumbar stenosis. Two of the questions assessed the role of epidural steroid injections (ESI), while the third question explored the long-term results of medical management of stenosis. When assessing the role of contrast-enhanced, fluoroscopic guidance for the injection of epidural steroids, both the guidelines and ChatGPT agreed that it is recommended in order to improve the accuracy of medication delivery. Although there are only a couple of studies in the literature that specifically assess the effectiveness of fluoroscopically guided epidural steroid injections in lumbar stenosis, they all reached the same consensus as highlighted by the guidelines.19,20
The second question explored the role of ESI in the management of spinal stenosis, and our results showed that ChatGPT was able to accurately able to comment on the location and the benefits of the injection. There are a couple of randomized trials that have assessed the efficacy of ESI in spinal stenosis and found that a slight benefit was detected in patients in the short term.21,22 Interestingly, both the guidelines and ChatGPT emphasized that most ESIs were only analyzed in short-term studies and there wasn’t enough evidence to comment on the long-term effects, suggesting ChatGPT is capable of adapting their recommendations to include short and long-term analyses of interventions. The results from the final question found that ChatGPT was able to accurately conclude that long-term results following the medical management of stenosis may improve patient outcomes, but that it is dependent on a case-to-case basis. These results are a further suggestion that ChatGPT is capable of understanding that broad questions like these cannot be answered in a general sense and that it is important to have more information about the individual patient in order to make a definitive conclusion.
Surgical Interventions
Within the NASS guidelines, all three questions pertaining to the surgical interventions involving spinal stenosis were provided with recommendations based on evidence-based literature. One of the most common procedures performed is spinal decompression, which the guidelines recommend as the intervention for patients with at least moderate symptoms of lumbar spinal stenosis. Within the literature, there is a multitude of systematic reviews and meta-analyses in the past five years that have associated spinal decompression with favorable outcomes for patients with spinal stenosis.23-26 ChatGPT provided a very similar recommendation to the guidelines and importantly, took into account that this procedure should be assessed on a patient-to-patient basis. These results suggest that if given adequate information, ChatGPT may be capable of recommending even more specific interventions based on the condition of the patient. However, the precision of ChatGPT’s answer, in this case, was graded as imprecise due to its citation of false evidence from studies that don’t exist in the literature in its response.
The second question presented in the NASS guidelines explored whether the addition of fusion with spinal decompression improves patient outcomes. Both the guidelines and ChatGPT highlighted that decompression alone is the superior approach, especially for patients who don't present with instability. These findings have been recently corroborated in the literature by multiple studies.25,27,28 Once again, ChatGPT presented a conservative recommendation to this question and emphasized that only patients that present with instability tend to get this operation. However, it once again presented within its response fictitious papers that could not be found in the literature.
Finally, the NASS guidelines investigated what the long-term results were for the surgical management of stenosis and concluded that there had been significant improvements in a large percentage of patients after >4 years postoperatively. They also recommend that surgical decompression be considered in patients who are ≥75 years or older. A handful of long-term studies that addressed this specific question were found to be in favor of the recommendations from the guidelines.29-31 ChatGPT was also able to reach the same conclusions, but once again, had within its response fabricated evidence from false articles. In all, it is important to note that ChatGPT can be used to assess the benefits and risks associated with different surgical interventions in lumbar spinal stenosis, but may fall short in providing accurate evidence-based literature within its responses.
ChatGPT recommendations vs insufficient data in guidelines
Due to insufficient evidence at the time of writing, the NASS guidelines did not provide recommendations for five questions, including physical exercise and therapy, manipulation, pharmacological treatment, ancillary treatments, and physical findings that are associated with lumbar spinal stenosis. However, ChatGPT provided recommendations that were cross-referenced with the contemporary literature.
Physical Exercise and Therapy
When assessing the role of physical therapy and exercise in the treatment of spinal stenosis, the NASS guidelines didn’t make a recommendation about its utilization and broadly recommended that any active therapy can be considered an option. On the other hand, ChatGPT provided specific exercises that could be implemented for patients, including strengthing, stretching, and aerobic exercises. The conclusions reached by ChatGPT have been studied in the recent literature. A randomized controlled trial (RCT) of 104 patients with spinal stenosis found 82% of participants who were part of a 6-week structured comprehensive training program that consisted of 18 exercises and stretches targeted for the lower back reached a minimally clinically important difference (MCID) in mean walking capacity compared to 63% in the self-directed program. 32 Another RCT that analyzed 86 stenosis patients found supervised physical therapy that included stretching and strengthening exercises, cycling, and treadmill walking to be significantly associated (P = .01) with higher MCIDs compared to a home-exercise group. 33 The findings from these studies match the recommendations generated by ChatGPT, suggesting that it is capable of providing accurate recommendations to clinicians when it comes to exercise and therapy for lumbar stenosis.
Manipulation
With regards to manipulation techniques, which involve manual therapy usually performed by chiropractors or physical therapists, the NASS guidelines abstained from recommending both in favor or against this approach to the management of spinal stenosis due to a lack of evidence. However, ChatGPT did make certain recommendations regarding the impact that manipulation may have on spinal stenosis, taking into consideration the controversies behind some of the practices. There have been multiple studies that analyzed the effects of manipulation techniques in stenosis patients with varying results. For instance, Choi et al 34 found that flexion-distraction spinal manipulation was effective at reducing pain in patients with spinal stenosis to a greater degree compared to those treated with conservative physical therapy. This study aligns with the statement given by ChatGPT with regard to pain outcomes when using manipulation in spinal stenosis patients. Another study by Oh et al 35 reported that flexion-distraction reduces pain to a greater extent than conservative physical therapy, indicating that manual manipulation therapy is an effective intervention for pain management. Furthermore, Smith et al 36 concluded that flexion-distraction manipulation improves both pain and function in lumbar spinal stenosis (LSS) patients, recommending its utilization in the clinical setting. The results from these studies are in agreement with ChatGPT’s recommendations of manipulation with regard to function since it was found that manipulation therapy improved function in patients burdened with spinal stenosis.
Other studies have assessed the quality of the existing literature with regard to the role of manipulation in LSS and concluded that there is “moderate-quality evidence” supporting the positive role that manipulation has in managing LSS. 37 This systematic review found a single-randomized clinical trial testing the effects of manipulation on LSS and the referenced study found no significant improvements in motion or function following manipulation in patients with LSS. 38 In accordance with the output from ChatGPT, the literature does have conflicting evidence regarding the benefits of manipulation. Thus, ChatGPT gave an accurate representation of the current body of literature regarding manipulation and outlined documented benefits in terms of pain and function, while still being conservative and acknowledging the conflicting body of evidence.
Pharmacological Treatment
The NASS guidelines have made no recommendations for or against pharmacological treatment for LSS. However, ChatGPT did generate an output regarding pharmacological treatment and denoted it as “useful” in the treatment of spinal stenosis. Specifically, ChatGPT gave a broad pain control recommendation including both NSAIDs and acetaminophen, both of which have been studied in the literature. Chou et al 39 conducted a systematic review on the effectiveness of pharmacological treatments for the treatment of lower back pain and found that acetaminophen was ineffective for chronic lower back pain (CLBP), a condition with symptomatology similar to LSS. Another systematic review by Enthoven et al 40 identified weak evidence that favors NSAID use for the long-term management of CLBP, concluding that no firm recommendation can be made about their implementation in the clinic. Despite the weak evidence in the literature regarding NSAID or acetaminophen use in the treatment of CLBP or LSS, ChatGPT still deemed these pharmacological treatments as useful in the management of LSS symptomatology.
Additionally, ChatGPT’s recommendation failed to incorporate new evidence with regard to pharmacological treatments that have historically been used for neuropathic pain. One medication of this nature is gabapentin which is typically used in diabetes-associated neuropathic pain, but some studies have explored its effects in the management of LSS. A study by Haddadi et al. reported a decreased number of patients with paraesthesia and claudication 8 weeks after starting Gabapentin. Furthermore, a systematic review by Ammendolia et al 41 found that gabapentin was effective in reducing pain and improving walking distance. Other pharmacological treatments that have been documented in the literature include epidural corticosteroids which have been tested for their effectiveness in LSS management. A recent randomized clinical trial by Friedly et al 22 found no benefit of epidural corticosteroids in terms of disability and pain over a period of 12 months in patients with LSS. A more rigorous randomized trial by Friedly et al 21 in 2014 similarly concluded that over 12 months epidural corticosteroid injections show no significant benefit in terms of disability and pain associated with LSS. Other medications like duloxetine have been tested in patients with chronic lower back pain (CLBP) which has symptomatology similar to that of LSS. Duloxetine is thought to work by increasing neurotransmitters involved in the descending pain inhibitory pathway promoting analgesia via that mechanism. Another study by Konno et al 42 reported improvements in CLBP at week 14 compared to placebo (P = .0026) as well as improvements in impression of severity (P = .0019), and disability (P = .0439). ChatGPT failed to present information available on gabapentin, epidural corticosteroids, and duloxetine which all have supporting evidence for their potentially positive role in the management of LSS. In terms of pharmacological recommendations for LSS, ChatGPT inappropriately recommended NSAIDs and acetaminophen for LSS and did not include other pain medications for neuropathic pain widely explored in the body of literature for their potential benefits.
Ancillary Treatments
With regard to ancillary treatments, the NASS guidelines did not make a definitive recommendation about treatments such as bracing, electrical stimulation, or acupuncture. ChatGPT commented on the role bracing can play in reducing pressure on the thecal sac and nerve roots, while also mediating support to the lower back. There is very limited literature that specifically assessed the roles of bracing in the context of stenosis. One study by Ammendolia et al 43 developed a prototypical lumbar stenosis belt and conducted a two-arm, double-blinded RCT that assessed walking distance in 104 patients who either wore the prototype vs patients who wore standard lumbar support (Tensor 3M Canada). The authors concluded that both the prototype and standard lumbar support significantly improved the walking ability of patients with lumbar stenosis, but there was no significant difference found between the two groups. Based on the improved outcomes of patients in this study, ChatGPT’s response was deemed accurate when it comes to bracing recommendations.
ChatGPT further commented on the role of electrical stimulation and specifically Transcutaneous Electrical Stimulation (TENS), recommending their utilization in either the hospital or home setting for pain reduction and muscle strengthening, although there is also very limited literature on this data. A recent RCT conducted by Harmsen et al 44 assessed the role of neuromuscular electric stimulation in patients presenting with leg cramps secondary to their diagnosis of lumbar spinal stenosis. Thirty-two participants were grouped into four cohorts that received either 85%, 55%, 25%, or 0% of their maximum tolerated stimulation intensity and found that the first three groups reported a significantly reduced amount of leg cramps. These findings suggest that even at low stimulation levels, the application of electrical stimulation can alleviate pain in patients with lumbar stenosis. Another RCT by Ammendolia et al 45 explored how TENS can impact neurogenic claudication and walking ability by randomizing 104 patients either to an active (n = 49) or detuned TENS (n = 51) group. The results revealed that 71% of the active TENS and 74% of the detuned TENS participants achieved a 30% improvement in walking distance, priming the authors to recommend against the utilization of TENS in clinical practice. Since ChatGPT did not comment in its statement about this study, and because there is a lack of evidence to make these recommendations when it comes to ancillary treatments, this response was marked as imprecise.
Furthermore, the NASS guidelines had commented on the role of acupuncture but ChatGPT did not mention this intervention. However, based on the most recent literature search, it is plausible to assume that most of ChatGPT’s recommendations are backed by limited evidence-based studies and can serve as a useful tool when it comes to assessing ancillary treatments.
Physical Findings Associated With LSS
The NASS guidelines did not provide a recommendation for one of the three questions that fell under the category of definitions and natural history of the condition. In this case, the guidelines did not have sufficient evidence to make any recommendations related to the history and physical findings associated with a diagnosis of lumbar spinal stenosis other than older patients who present with a history of lower extremity pains that are exacerbated by walking or standing which can be characterized as having spinal stenosis. ChatGPT failed to categorize the patients by different demographic factors such as age in its response when assessing the physical findings, making its conclusion less accurate.
The NASS guidelines also commented on the utilization of self-administered questionnaires in the diagnosis of spinal stenosis. However, ChatGPT did not mention qualitative measures such as questionnaires within its response, further decreasing its preciseness compared to the guidelines. There have been more recent studies that have been published on the benefits of such qualitative measures. For instance, a study by Tominaga et al 46 assessed the effectiveness of two measures, the LSS-diagnosis support tool (LSS-DST) and the LSS-self-administered self-reported history questionnaire (LSS-SSHQ), and found that both tools were more sensitive in screening patients for spinal stenosis than traditional measures presented by the NASS guidelines. Another recent study by Wilartatsmai evaluated the reliability of the Swiss Spinal Stenosis (SSS) questionnaire in one hundred and seven patients and found it to be a strongly valid and reliable tool in the evaluation of physical functions and symptom severities in patients with spinal stenosis. 47 Taken together, the conclusions from these studies suggest the implementation of qualitative measures in the diagnosis of stenosis. The remainder of the response reported by ChatGPT provided general physical findings that have been found to be associated with spinal stenosis while noting that it is best to get a thorough history, physical examination, and diagnostic imaging to make an accurate decision.
Future Directions
ChatGPT is a non-generative AI model that is mostly trained to recognize patterns and generate predictions when it is presented with a prompt. However, with the rise of new generative AI models that are capable of assessing and synthesizing new content that is fine-tuned from medical databases, it is possible to generate stronger results than what our study has shown. Future research is needed to assess these new generative models to provide clinicians with the most recent and productive tools that can help make decisions in the field.
Limitations
A limitation of this study is that the NASS Evidence-Based Clinical Guidelines for Lumbar Spinal stenosis have not been updated since 2011, while ChatGPT has derived data from up to 2021. Thus, it is plausible that some of the recommendations presented by the guidelines are outdated. However, we conducted a contemporary literature search about the most recent evidence-based studies as a means of assessing ChatGPT’s reliability to reach conclusions based on new data. Another limitation is that some questions presented by the guidelines were too broad, priming ChatGPT to generate responses that were generally non-specific. By priming the questions with more specific details, ChatGPT may better serve clinicians with more accurate responses, although further work is needed to prove this. Furthermore, a third limitation is that ChatGPT have been trained to generate definitive responses to direct questions, even if it is sometimes better to provide a response that claims that there is not enough evidence to make a recommendation. This phenomenon could be costly to patients and has been labeled as “artificial hallucination”, which highlights the issue with ChatGPT trying to cite references that do not exist in support of its argument. Additionally, ChatGPT may have been trained on NASS guidelines when first developing its algorithm, possibly introducing a response bias to the specific questions it was presented with in our study. Finally, clinically fine-tuned models such as Med-PaLM 2 are not publicly available, which is why ChatGPT was utilized in this study.
Conclusions
The implementation of AI models in the medical setting is becoming a promising tool that can help physicians make better decisions when it comes to patient care. Our study suggests that ChatGPT may accurately answer questions and provide evidence about the diagnosis and treatment options for lumbar spinal stenosis that is line with the current available evidence. However, it has shown to fabricate evidence from time to time when it comes to specific questions pertaining to spinal stenosis. Thus, we recommend patients and clinicians utilize ChatGPT with caution. Further study is required to understand the full impact ChatGPT can provide in the clinical setting for spine surgery and beyond.
Footnotes
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
