Abstract
Study Design:
Systematic review.
Objectives:
The purpose of this study is to review outcomes reporting methodology in studies evaluating fusion for lumbar spinal stenosis.
Methods:
A systematic review of PubMed and Embase databases was conducted from January 2007 to June 2017 for English language studies with minimum of 2 years postoperative follow-up reporting outcomes after fusion for lumbar spinal stenosis. Two reviewers assessed each study; those meeting inclusion criteria were examined for pertinent data. Outcome measures were categorized into relevant domains: pain/symptomatology, function/disability, and surgical satisfaction. Return to work reporting was also recorded.
Results:
Of 123 studies meeting inclusion criteria, 76% included posterior-only fusion, 32% included posterior/transforaminal interbody fusion, and 5% included anterior/lateral interbody fusion (non-mutually exclusive). There was significant variation in patient-reported outcomes (PROs) used—studies reported 31 unique PROs assessing at least one domain: 22 evaluating pain, 23 evaluating function, and 3 evaluating surgical satisfaction. Most commonly utilized PROs were the Oswestry Disability Index (73% of studies), Visual Analog Scale (55%), and 36-Item Short Form Survey (32%). The remaining 28 measures were used in 14% of studies or fewer. PROs specific to symptoms of lumbar spinal stenosis, such as the Zurich Claudication Questionnaire, were only used rarely (7/123 studies). Only 14% of studies reported on time to return to work.
Conclusions:
The literature surrounding fusion in the setting of lumbar stenosis is characterized by substantial variability in outcomes reporting. Very few studies utilized measures specific to lumbar spinal stenosis. Efforts to standardize outcomes reporting would facilitate comparisons of surgical interventions.
Keywords
Introduction
The number of patients with lumbar spinal stenosis (LSS) treated with fusion has been rising rapidly. A recent epidemiological study of the Nationwide Inpatient Sample found that the rate of decompression and fusion for a diagnosis of LSS rose from 21.5% to 31.5% between 2004 and 2009, while the rate of decompression alone dropped from 58.5% to 49.2%. 1 While surgery has been shown to be superior to conservative management in many studies, 2 -4 other investigations have not been as definitive in their results. 5,6 The ability to measure the effectiveness of surgery is likely hampered by a wide variety of surgical techniques used to treat LSS and assortment of outcomes measures used to define results. 7 A 2016 Cochrane review was unable to definitively conclude that surgery was superior, due to the heterogeneity of interventions and lack of standardized outcomes measures. 8
The future of health care delivery is based on value—a measure of health outcomes achieved per dollar spent. 9 Value will dictate both whether surgery for LSS is superior to conservative management and whether one surgical intervention is superior to another. Thus, the accurate determination of value must be based on standardized outcomes. Patient-reported outcomes (PROs) form the basis on which value will be determined. 9 The National Institute of Health’s effort to define PROs through its PROMIS initiative is just one example of how powerful entities in health care are emphasizing PROs to define value. 10
Multiple high-quality analyses on surgical interventions for LSS have employed PROs in their determination of effectiveness. 3,4,11 -13 Unfortunately, a wide variety of PROs are used in spine surgery research, 7 limiting our ability to compare results between surgical techniques. As payer systems move toward value-based care, the need to categorize and understand PRO instruments used to evaluate the postoperative results of common interventions becomes ever more pressing. The purpose of this study is to identify and characterize the PROs utilized in evaluating the efficacy of fusion for LSS. We hypothesize that there are a wide variety of PROs used, and that the majority of utilized PROs are too broad, and thus inadequate, to best assess results after lumbar fusion for spinal stenosis.
Materials and Methods
The PubMed and Embase computerized databases were systematically searched to identify all literature in the last decade (January 2007 through June 2017) reporting outcomes after spinal fusion for lumbar stenosis (Table 1). All articles were retrieved by an electronic search of Medical Subject Headings and keyword terms and their respective combinations (Figure 1). 14 -16 Inclusion criteria consisted of any study recording patient-reported clinical outcomes after lumbar fusion for spinal stenosis. Studies that compared fusion cohorts with those undergoing nonsurgical management, decompression alone, or dynamic stabilization were also included. Exclusion criteria included follow-up of less than 24 months; animal, biomechanics, cadaveric, and basic science studies; review articles; surgical technique guides; and case reports. Also excluded were studies for which English full text was not available.
Database Search for Systematic Reviewa.
a Search terms entered into PubMed and Embase search engines to identify English language studies from January 2007 to June 2017.

Flow diagram representing the systematic review process used in this study. A total of 123 studies were included for final analysis.
The literature search is outlined in Figure 1. The initial title search yielded a subset of possible articles that were then further included or excluded according to the contents of the article’s abstract, wherein articles were again selected based on inclusion and exclusion criteria. The full text was reviewed of articles selected in both the title and abstract phase. Appropriate studies for final inclusion were then selected. The title, abstract, and full-text selection process with assessment of bias at the study level was performed independently by 2 study authors (JPW and FL) with any discrepancies discussed and resolved by mutual agreement.
Several metrics were collected from each study to describe the patient population, including level of evidence per Oxford Center for Evidence-Based Medicine (OCEBM) criteria, 17 study design, number of patients at baseline and follow-up, mean patient age, gender distribution, inclusion of patients with spondylolisthesis (Meyerding grade I or II only, grades III and above were not reported by any study), follow-up time, and type of fusion surgery (as well as use of nonfusion controls). When fusion method was not specified, it was assumed as posterolateral fusion given the prevalence of this method. 18 Moreover, a majority of assessed studies specifying fusion method cited use of this technique as opposed to interbody or other approaches. Proportions were calculated based only on studies that reported the given metric, as several studies lacked number of patients at follow-up, gender distribution, and age.
Patient-Reported Outcomes
Outcomes of interest encompassed any validated PROs recorded in the included studies. PROs were further classified based on multiple domains: pain/symptomatology, function/disability, and surgical satisfaction (Figure 2). Several PROs were questionnaires that captured outcomes in 2 or all domains. Documentation of return to work and/or baseline activity was recorded as a marker of function/disability.

All (31) identified patient-reported outcomes measures by domain reported in studies (n = 123) included for review.
Results
Study Characteristics
Study characteristics are reported in Table 2. This systematic review included 123 published works: 4 Level I studies (3%), 25 Level II studies (20%), 48 Level III studies (39%), and 46 Level IV studies (37%) per OCEBM criteria. 17 The mean number of patients in each study was 373.9 (median, 94; range, 17-8142) at baseline and 311.7 (median, 91; range, 17-5390) at most recent postoperative follow-up. Minimum follow-up time averaged 37.1 months (range, 24-120). Several studies reported multiple methods of fusion—76% of studies assessed PROs following posterior or posterolateral fusion, while 32% assessed PROs following posterior or transforaminal interbody fusion. Only 5% assessed PROs following anterior or lateral interbody fusion. Overall, 31 distinct PROs were used among the 123 studies (Figure 2).
Study Characteristicsa.
Abbreviations: PLF, posterior/posterolateral lumbar fusion; PLIF, posterior lumbar interbody fusion; TLIF, transforaminal lumbar interbody fusion; ALIF, anterior lumbar interbody fusion; LLIF, lateral lumbar interbody fusion.
a Includes demographics of the patients included in the studies reviewed as well as the overall characteristics of the studies included in the review.
PRO Measures
Figure 3 portrays the distribution of PRO measure utilization among studies. The mean number of reported measures was 2.76 (range, 1-8). Twenty percent of studies reported a single PRO measure and 30% reported 2 measures. The remaining 50% of studies presented 3 or more PRO measures for assessment.

Number of PRO measures reported per study.
The 10 most frequently cited PRO measures among the 123 reviewed studies are depicted in Figure 4. The Oswestry Disability Index (ODI), Visual Analog Scale (VAS), and 36-Item Short Form Survey (SF-36) were utilized most often—73%, 55%, and 32% of studies, respectively. No other measure was used in more than 14% of the analyzed studies. VAS was the most prominent measure reporting pain/symptomatology alone (back pain, leg pain, or both), whereas the Japanese Orthopedic Association Score (JOA) was the most prominent measure reporting solely function/disability (12%). Similarly, surgical satisfaction was usually assessed via categorical approach using 2-point, 5-point, or other Likert-type scales (13%).

Top 10 most frequently reported PRO measures (of 124 studies).
Temporal Trends
Overall, the number of publications per year citing PRO measures as a means of evaluating outcomes following fusion for lumbar stenosis increased from 10.2 (2007 to 2011) to approximately 13.1 (2012 to mid-2017). Table 3 stratifies studies by the aforementioned time intervals and ranks measures by frequency of use. Furthermore, the mean number of PRO measures reported per study was significantly greater in studies published between 2012 and 2017 (3.08) compared to those published between 2007 and 2011 (2.31) (P = .003).
PRO Utilization Stratified Over Timea.
Abbreviations: ODI, Oswestry Disability Index; VAS, Visual Analog Scale; SF-36, 36-Item Short Form Survey; JOA, Japanese Orthopedic Association Score; RM, Roland Morris Disability Questionnaire Score; EQ5D, European Quality of Life-5 Dimensions; Zung SDS, Zung Self-rating Depression Scale; ZCQ/SSSM, Zurich Claudication Questionnaire/Swedish Spinal Stenosis Measure.
a Reflects instances of reporting as a discrete, freestanding measure.
Discussion
Understanding the value of a surgical intervention depends on the outcome measures used. While radiographic and objective outcomes play a role in determining value, trends in health care payment models suggest that PROs will be increasingly important. Surgical interventions for LSS are already highly criticized in the medical community given conflicting data on outcomes, 19 so the ability to evaluate these interventions is increasingly important and dependent on standardized and validated PROs. On the subject of said clinical outcomes, we recognize that the predominant limitation to this review is based on potential for publication bias across and selective reporting bias within individual studies. That said, we expected each to be at least partially mitigated by the fact that we were assessing for variability in reporting methods and measures, rather than the clinical outcomes themselves.
We hypothesized that there would be a wide variety of PROs used, and that many of the PROs would be too general to account for outcomes after lumbar spine fusion. Our review of 123 recently published long-term studies on outcomes after fusion for LSS confirms the first part of our hypothesis. Even though we purposefully narrowed our search to a specific type of surgical intervention (fusion) for a specific diagnosis (LSS), and only included studies with greater than 2 years of follow-up, there were an overwhelming number of PROs used. In addition, the number of PROs reported per study has also increased over time. With regard to the second part of our hypothesis, the most popular PROs were not specific to outcomes after lumbar spine surgery.
Prior work has demonstrated the wide variability of PROs in spinal surgery. Guzman et al reviewed the frequency, trends, and methods of utilization of various spine-related PROs from 2004 to 2013, and also came to the conclusion that an extensive variety of PROs are utilized in spine surgery. Unlike our review, they chose articles on any form of spine surgery (lumbar, thoracic, cervical) for any diagnosis, and included articles from only 5 orthopedic journals (neurosurgical journals were excluded), identifying 206 unique PROs in 1079 spine surgery articles. 7 Despite the use of a much broader search, the most commonly used PROs were the same as those identified in our study—the VAS, ODI, and SF-36. Similarly, Yadla et al performed a systematic review of outcomes studies on adult spinal deformity (ASD), finding that research on ASD is similarly plagued by a wide variety of PROs without standardization of outcomes instruments. 20 Finally, Ueda et al assessed PROs in degenerative cervical spine surgery and reported comparable heterogeneity, with 53 total outcomes reported. 21 The authors of these studies conclude that there is a need for greater standardization and guidelines to specify which instruments should be used for a given spinal disease or treatment. While it is less surprising that there are heterogeneous outcomes reported across various spinal pathologies in the study by Guzman et al, our findings of similar heterogeneity in a focused search in LSS confirms these concerns and highlights the need for standardization of outcomes.
Similarly to the findings in Guzman et al, 7 the 3 most popular PROs used to evaluate lumbar spine fusion for spinal stenosis were the ODI, SF-36, and VAS pain scale, utilized in 73%, 55%, and 32% of studies, respectively. The most popular of the PROs, the Oswestry Disability Index, was first published and validated in 1980 by Fairbank et al as a measurement of disability in patients with chronic back pain. 22 The ODI is available in multiple languages, and has only undergone one minor update since its inception—this popularity likely explains its wide use. The most detailed evaluation of the tool’s face-content (ie, the instrument’s effectiveness in measuring outcomes as intended or claimed) was conducted in patients with chronic back pain 23 and was found to be nonspecific to spine pathology. Multiple trials have demonstrated that leg pain, rather than chronic back pain, is more predictably relieved by surgery for LSS. Thus, the common use of a PRO designed to measure disability secondary to chronic back pain may very well underestimate the effect of surgery for LSS and therefore may not be the most appropriate measure.
The SF-36 originated from the efforts of the Medical Outcomes Study (MOS), a longitudinal study conducted in a general population of adult patients with medical conditions to determine the effect of various system-related factors that influence care. 24 The survey relies on 8 concepts of health: physical functioning, bodily pain, role limitations due to physical health problems, role limitations due to personal or emotional problems, general mental health, social functioning, energy/fatigue or vitality, and general health perceptions. The questionnaire has been used across all types of pathology and captures a broad range of domains, and its wide applicability suggests that the SF-36 is a validated PRO to measure a patient’s overall quality of life. It is reasonable to suggest that that surgical intervention for spinal stenosis may well affect some of the domains more so than others, or in more nebulous ways. In particular, general mental health, vitality, and role limitations due to personal or emotional problems may contribute to both surgical eligibility and postoperative outcome, not to mention the patient perception of such.
The third most commonly utilized PRO in our review, the Visual Analog Scale for pain, was developed in 1972 as a measurement of pain intensity. 25 Popular for its simplicity and ease of use, it is almost ubiquitous to measure a patient’s pain level, but recent studies cast wide doubt on whether this tool can capture the complexity and idiosyncratic nature of pain experience. 26,27 For example, while 2 patients experience a moderate level of pain intensity, due to personal circumstances/daily requirements one patient may be extremely bothered by this level of pain while another patient may be minimally affected. Other measures of pain, such as the Sciatica Bothersome Index (SBI), which rates the “bothersomeness” of parethesias, weakness, and leg or back pain, may be more specific to symptoms of LSS and more accurately measure the effect of this condition, and subsequent intervention. 28,29 This outcome instrument has been used by well-known trials on interventions for lumbar spine surgery, including the Maine Lumbar Spine Study and the Spine Patient Outcomes Research Trial. 3,4 However, our review found that out of all included studies, this PRO was utilized by only 4 articles published between 2007 and 2011, and even fewer published between 2012 and 2017.
Similarly to the rare utilization of the SBI, our review found few investigations that utilized PROs developed specifically to measure symptoms associated with LSS. The Zurich Claudication Questionnaire/Swedish Spinal Stenosis Measure (ZCQ/SSSM) was developed at 3 academic hospitals in 1996 as a patient questionnaire used to measure the effect of decompressive surgery on patients with LSS. 30 Since that time it has been validated in multiple languages. 31,32 The questions selected for inclusion were chosen based on consensus opinion of an expert panel (rheumatologists, orthopedic surgeon, behavioral scientist) combined with a literature review. Each question corresponds to symptom severity, physical function, or satisfaction with the results of the back operation. The ZCQ/SSSM and the SRS-22 (developed for use in spine deformity surgery) were the only 2 PROs that encompassed the 3 PRO classifications we defined in our search: pain/symptomatology, function/disability, and surgical satisfaction. The most widely known application of the ZCQ/SSSM was in trials to measure the effectiveness of the X-STOP (Medtronic, Switzerland) dynamic stabilization device. 33,34 Despite its more focused nature, in our review the ZCQ/SSSM was only used in 7 total studies.
In our increasingly cost-conscious environment, we must select treatments and interventions that deliver the greatest value for a given cost. On a disease-specific level, this requires standardized outcome instruments validated for a specific pathology to enable accurate comparison across studies. The current LSS literature, with heterogeneous reporting and suboptimal PRO use (eg, ODI), limits our ability to draw such comparisons. We must reach consensus on outcomes reporting in LSS, whether it be an existing disease-specific measure, such as the ZCQ/SSSM, or an entirely new measure developed by an international panel of experts (anecdotally, we found that researchers in different regions of the world were especially prone to using varying methods, despite meeting in the same congress). Though it is difficult to truly demonstrate the superiority of one measure versus another without additional dedicated randomized trials, previously cited traits of a unifying measure would include disease-specific outcomes reporting, measures of general health, pain, satisfaction, and employment. 7,21,35 We agree that reporting a general health measure should be standard practice in studies of LSS given that payers are increasingly forced to restrict payments to the most cost-effective interventions both within and across pathologies. The SF-36 is the most commonly reported general instrument seen in our study but has limitations, including significant floor/ceiling effects in the cervical population. 36 An alternative instrument is the National Institutes of Health PROMIS, which provides consistent, reliable, and responsive PROs that are generalizable across a variety of diseases and have been validated across a variety of orthopedic conditions. 37 -41 PROMIS assesses physical, mental, and social health and uses item response theory, which is more responsive to changes than traditional testing, as well as computer adaptive testing, which decreases burden on the patient and allows for improved integration into electronic medical record systems. PROMIS has the potential to be an efficient, reliable, and responsive instrument to measure general health and could be adopted in the LSS as well as other spinal literature.
Conclusions
Here we demonstrate that the LSS literature is characterized by substantial variability in outcomes reporting, with 31 total PROs and 90% of instruments reported in fewer than 15% of articles. Regarding impact on employment, only 14% of studies reported on return to work. Efforts to standardize outcomes reporting, including both a disease-specific and general health outcome measure, would facilitate comparison across the literature and improve our understanding of the prognosis of this disease. As the cost of care becomes increasingly scrutinized, such standardization will enable value comparisons across disciplines.
Footnotes
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
