Abstract
Study Design:
Diagnostic study, level of evidence III.
Objective:
Pyogenic spondylodiscitis can cause deformity, neurological compromise, disability, and death. Recently, a new classification of spondylodiscitis based on magnetic resonance imaging was published. The objective of this study is to perform an independent reliability analysis of this new classification.
Methods:
We selected 35 cases from our database of different spine centers in Latin America and from the literature; 8 observers evaluated the classification and graded the scenarios according to the methodological grading of the classification developed by Pola et al. Cases were sent to the observers in a random sequence after 3 weeks to assess intraobserver reliability. The interobserver and intraobserver reliabilities were performed with Fleiss and Cohen statistics, respectively.
Results:
The overall Fleiss κ value for interobserver agreement was substantial, with 0.67 (95% CI = 0.43-0.91) in the first reading and 0.67 (95% CI = 0.45-0.89) in second reading for the main types of classification. The Cohen κ value for intraobserver agreement was also substantial, with 0.68 (95% CI = 0.45-0.92). The interobserver agreement analysis for the subtypes of this classification was overall substantial, with 0.60 (95% CI = 0.37-0.83) in the first reading and 0.61 (95% CI = 0.41-0.81) in the second reading. The overall intraobserver agreement for subtypes of the classification was also substantial, with 0.63 (95% CI = 0.34-0.93).
Conclusion:
The new classification developed by Pola et al showed substantial interobserver and intraobserver agreements. More studies are required to validate the usefulness of this classification especially in clinical practice.
Keywords
Introduction
Pyogenic spondylodiscitis (PS) is an infectious disease that involves the vertebral endplates and can extend into the disc space. PS has an estimated prevalence of 6.5 per 100 000 in western societies and is associated with increased morbidity, hospital length of stay, and mortality. 1 This condition has been shown to be more prevalent among elderly patients with chronic debilitating conditions, those with immunodeficiency, and intravenous drug users. 2 The most common location is the lumbar spine, followed by thoracic and cervical regions, 3 and the most frequent agents are staphylococcal sp. and streptococcal species. 4
The goals of treatment are to relieve pain, avoid neurological deterioration, eradicate infection, provide spinal stability, and prevent deformity. 5 -7 Orthopedic guidelines with proper algorithms for the management of PS are not universally accepted and usually rely on clinical studies with variable inclusion criteria. 8 -12 Ideally, a classification system should be easy to apply, inclusive, reproducible, 13,14 and if possible, helpful in guiding a treatment option with recommendations.
Previous attempts to create a proper classification system with treatment guidelines has been proposed 15,16 ; however, to date, no classification system has been universally accepted. In 2017, Pola et al 17 developed a new classification of PS based on contrast-enhanced magnetic resonance imaging (MRI) findings to define a treatment algorithm. The objective of this study is to perform a reliability study of the new classification developed by Pola et al 17 through interobserver and intraobserver analyses to assess the concordance of this classification among readers.
Material and Methods
After approval from our institutional review board (Protocol Number IRB 0 003 937), we conducted a multicenter study to assess a validation through an independent analysis of the classification. We identified 8 young spine surgeons (observers) to evaluate 35 PS contrast-enhanced MRI scenarios each one related to different spondylodiscitis cases according to the classification described by Pola et al 17 (see Table 1). In their study, Pola et al 17 described the clinical-radiological classification of spondylodiscitis in 250 patients with treatment recommendations and at least a 2-year follow-up. The classification is based on MRI according to major criteria, such as the presence of instability, epidural abscess, and neurological compromise, and minor criteria, such as paravertebral soft tissue of intramuscular abscess. The authors also proposed a treatment algorithm based on the main types and subtypes (Table 2).
Classification of Spondylodiscitis According to Pola et al. 17
Treatment Algorithms According to the Classification.
The clinical scenarios were gathered from a database of 6 centers in Latin America by the authors of the study. The invited observers were spine fellows from diverse centers in Latin America who did not belong to the designer team and were not familiarized with the included cases. The designer team (authors) received 97 cases of spondylodiscitis; each case was received by an email containing a case presentation with MRI and clinical information (presence of neurological compromise based on physical examination). All cases were classified individually by the authors according to the classification developed by Pola et al. 17 This classification consists of 3 main types (A, B, and C) based on the presence of primary criteria on MRI: bone destruction of segmental instability, epidural abscess, and neurological impairment. Secondary criteria help define subtypes of the classification and are as follows: soft-tissue involvement and paravertebral muscular abscesses. Cases reaching a concordance of 100% among the authors were subselected, and 35 cases were then finally selected to be included in the analysis. The classification was sent by email to the observers; each observer was previously trained to apply the classification; and questions were explained before the final assessment. All 35 clinical cases (12 type A, 11 type B, and 12 type C cases; Table 3) were sent at once to the observers by email, and they had access to the classification scheme while grading (Table 1). All cases were sent back to the designer team with each respective classification type. After 3 weeks, all participating appraisers received the same 35 cases again, but in a different random order, to classify and send results back to the authors for intraobserver reliability analysis.
Distribution of Pyogenic Spondylodiscitis Cases According to the Main Types.
All data was collected and analyzed for reliability. Interobserver and intraobserver agreements were assessed in 2 different ways: for the main type of classification (types A, B and C) and for each subtype regarding complication (A1, A2, B2, etc).
Treatment Management
Our 35 cases were treated similarly according to the algorithm by Pola et al 17 (Table 2). Of the 12 type A cases, 10 received conservative treatment with orthosis and 2 cases required stabilization for persistent back pain. Four type B cases were treated conservatively, and 7 required stabilization; 8 type C cases required open debridement and decompression, and 4 cases required stabilization after decompression.
Statistical Analysis
Evaluation of Interobserver Agreement
In a first step, we evaluated the interobserver agreement through the calculation of the unweighted κ coefficient for the main type classification. The classification for main categories (types) for spondylodiscitis (Figure 1; types A, B, and C) was prepared as an ordinal variable represented by numerical values from 1 to 3 (type A: 1; type B: 2; and type C: 3); the authors assigned 1 point to perfect agreement between each pair of readers when both readers assigned the same type, and perfect disagreement was represented as a value of 0 when different types were assigned by 2 readers.

Cases of spondylodiscitis: (A, B) Type A spondylodiscitis. (C, D) Type B spondylodiscitis. (E, F) Type C Spondylodiscitis.
In a second step, we evaluated the interobserver agreement through the calculation of the weighted κ coefficient for subtypes of classification for each pair of judges (readers). Regarding the classification categories (subtypes) for spondylodiscitis, an ordinal variable represented by numerical values from 1 to 11 was prepared (subtype A1: 1; subtype A2: 2; subtype A3: 3; subtype A4: 4; subtype B1: 5; subtype B2: 6; subtype B3: 7; subtype C1: 8; subtype C2: 9; subtype C3: 10; subtype C4: 11).
The authors assigned 1 point to the perfect agreement between each pair of readers, defined as the situation in which both readers assigned exactly the same subtype of spondylodiscitis classification to the clinical scenario in question. When both readers had a disagreement of more than 1 category, it was represented by a value of 0. When the disagreement was of only 1 category of difference (eg, the same case was evaluated as corresponding to grade 2 of the classification by a judge and to grade 1 by the other reader), the authors agreed on an intermediate penalty represented by 0.5 or 0.75 points according to the case. A minimal penalty was applied to the close disagreements of low clinical relevance, representing their degree of agreement with a value of 0.75; those of greater clinical relevance were given an intermediate penalty, representing their agreement with a value of 0.5.
Evaluation of the Degree of Intraobserver Agreement
To assess the degree of intraobserver agreement (test-retest), a weighted κ coefficient was calculated according to the same weighting matrix as for the degree of agreement among the different observers. We determined sample size to provide adequate variability to assess discrimination among the main types of spondylodiscitis and acceptably precise reliability estimates. Based on a simulation process, when the sample consisted of 35 participants, each being assessed by 8 raters on a 3-category classification system, there would be a greater than 95% chance to reject the null hypothesis that the Fleiss j is less than 0.7; if true, then the Fleiss j is 0.9. Chance-adjusted Fleiss and Cohen statistics with 95% CIs were used to determine interobserver and intraobserver reliabilities, respectively. 18,19
The level of agreement (κ) was determined as proposed by Landis and Koch 20 with κ values of 0.00 to 0.20 considered slight agreement, 0.21 to 0.40 considered fair agreement, 0.41 to 0.60 considered moderate agreement, 0.61 to 0.80 considered substantial agreement, and 0.81 to 1.00 considered almost perfect agreement.
Results
Reliability for the Main Types of Spondylodiscitis
The overall interobserver agreement was 0.67 (95% CI = 0.43-0.91) for the first reading and 0.67 (95% CI = 0.45-0.89) for second readings. Agreement analysis based on each type is described in Table 4. Intraobserver agreement was 0.68 (95% CI = 0.45-0.92).
Interobserver Agreement of Main Types of Spondylodiscitis.
Reliability Analysis of the Subtypes of Spondylodiscitis
Assessment of reliability among observers for different subtypes of spondylodiscitis showed an interobserver agreement of 0.60 (95% CI = 0.37-0.83) in the first reading and 0.61 (95% CI = 0.41-0.81) in the second readings. A substantial intraobserver agreement of 0.63 (95% CI = 0.34-0.93) was found.
Discussion
Our study showed substantial interobserver (κ = 0.67) and intraobserver (κ = 0.68) agreements of the classification of Pola et al, 17 with moderate (κ = 0.53 and 0.49) agreement when classifying type B. The main reason that can explain why agreement was lower in type B could be that this classification considers spinal instability (defined as more than 25% in segmental kyphosis at the level compromised) as the primary criterion to differentiate between A and B types. This is a limitation of the classification and may lead to a potential misconception when classifying, especially because spinal instability is not always observed on MRI because this imaging exam is performed in the supine position. Instability criteria are better assessed through a standing radiograph instead of MRI; however, there are cases where instability secondary to bone destruction is evident even on MRI. Furthermore, many patients are unable to maintain a standing position for an X-ray, and we agree with the authors that MRI is the best modality to classify spondylodiscitis.
Another limitation of this classification is that it relies on an extensive number of subtypes, especially considering that different subtypes such as A2, A3, and A4 require the same treatment according to the author’s recommendation, with a similar concept in B1 and B2 subtypes. Probably a more simplified classification could be easier to understand and still useful regarding the treatment recommendations. 17 A classification system should be reproducible and useful for widespread acceptance in clinical practice and research; Pola et al 17 conducted no agreement analysis in their study, and to our knowledge, this is the first independent analysis for this classification.
Our study has strengths, such as the number of observers, which makes the results more reliable. In addition, the observers were from different institutions in Latin America and were all actively involved in spine surgery, not limiting the analysis to a single center. Another strength is that this reliability study was conducted by spine surgeons from a region other than Pola et al, 17 which decreases conflicts of interests.
It is important to state the limitations of our study. First, the designer team selected cases from a database and decided which cases were more likely to be analyzed and compared; this could represent a selection bias. On the other hand, 100% concordance among the authors was required to include the case in the analysis, making the selection process more reliable. Another limitation is the level of training of the observers, which included young spine surgeons instead of experienced attending surgeons. This could affect interobserver analyses. Ideally, experienced, trained spine surgeons could yield more reliable results. However, reliability studies usually include fellowship-trained physicians who usually show an active participation and availability for these studies, and many interobserver analyses showed no difference between residents and attending surgeons in terms of agreement. 13, 21, 22 Another limitation of this study is the lack of interobserver agreement between each subtype of the classification, probably a result of the number of cases required to make this specific measurement; instead, we evaluated the overall reliability of subtypes among readers, which was substantial (0.61). Knowing the reliability to classify each subtype would be ideal to understand which subtypes of the classification are most difficult to identify; on the other hand, the number of cases required to carry out this estimation is high (more than 100 cases to assess by each observer), and this also can affect the judgment of the observers because of the extensive analysis. Despite these limitations, this is the first independent interobserver and intraobserver agreement analysis to assess reliability of the new classification developed by Pola et al. 17
Conclusion
The new classification of spondylodiscitis proposed by Pola et al, 17 even though it is extensive regarding the subtypes, has shown substantial interobserver and intraobserver agreements and could be an important tool when classifying PS. More studies are required to evaluate the usefulness of this classification, especially in clinical practice.
Footnotes
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
