Abstract
Study design
Retrospective, mono-centric cohort research study.
Objectives
The purpose of this study is to validate a novel artificial intelligence (AI)-based algorithm against human-generated ground truth for radiographic parameters of adolescent idiopathic scoliosis (AIS).
Methods
An AI-algorithm was developed that is capable of detecting anatomical structures of interest (clavicles, cervical, thoracic, lumbar spine and sacrum) and calculate essential radiographic parameters in AP spine X-rays fully automatically. The evaluated parameters included T1-tilt, clavicle angle (CA), coronal balance (CB), lumbar modifier, and Cobb angles in the proximal thoracic (C-PT), thoracic, and thoracolumbar regions. Measurements from 2 experienced physicians on 100 preoperative AP full spine X-rays of AIS patients were used as ground truth and to evaluate inter-rater and intra-rater reliability. The agreement between human raters and AI was compared by means of single measure Intra-class Correlation Coefficients (ICC; absolute agreement; >.75 rated as excellent), mean error and additional statistical metrics.
Results
The comparison between human raters resulted in excellent ICC values for intra- (range: .97-1) and inter-rater (.85-.99) reliability. The algorithm was able to determine all parameters in 100% of images with excellent ICC values (.78-.98). Consistently with the human raters, ICC values were typically smallest for C-PT (eg, rater 1A vs AI: .78, mean error: 4.7°) and largest for CB (.96, -.5 mm) as well as CA (.98, .2°).
Conclusions
The AI-algorithm shows excellent reliability and agreement with human raters for coronal parameters in preoperative full spine images. The reliability and speed offered by the AI-algorithm could contribute to the efficient analysis of large datasets (eg, registry studies) and measurements in clinical practice.
Keywords
Introduction
Adolescent idiopathic scoliosis (AIS) is a complex two- or even three-dimensional spinal deformity. 1 In both conservative therapies using customized braces and surgical interventions, AIS patients undergo regular radiological examinations. Additional follow-up evaluations require the estimation of coronal curve characteristics as well as the progression of the coronal deformity. However, currently employed manual methods for the radiological image analysis are time-consuming and dependent on physician experience and therefore hinder the efficient and objective assessment of relevant coronal radiographic parameters in clinical routine and the analysis of large databases for research purposes. 2
Standing full-spine X-ray images and lateral bending images are the most common imaging modalities for the radiographic analysis of AIS. These images are used to assess the severity of the deformity, for which Lenke et al have developed a widely adopted classification system. 3 The Cobb angle is fundamental for the Lenke et al classification and thus the evaluation of AIS patients and their therapeutic decision-making process.4-6 Therefore, a reliable measurement of the Cobb angle is critical for clinical routine. Due to its significance in the characterization and treatment of AIS patients, the Cobb angle furthermore serves as a central parameter in studies analyzing inter-rater reliability in the radiographic assessment of coronal parameters. A recent study rated the inter-rater reliability (quantified by intra-class-correlation coefficients, ICCs) for the determination of the Cobb angle as excellent, with ICCs greater than .98 and an average variability of 3°. 7 These findings correspond to those in earlier studies demonstrating a correlation of .98 for repeated measurements with a standard deviation of the differences in primary Cobb angle of 2.5°-3.2°.8,9 However, excellent agreement did not apply to smaller curves (<20°) in the secondary Cobb angle, with a correlation of only .52. 9 In addition, only very few studies reported reliability on further essential coronal parameters (eg, shoulder balance). 10 Discrepancy in the assessment of manual measurements as well as the required time for the assessment of multiple radiographic parameters by physicians highlight the need for the objective, observer-independent, automated assessment of coronal radiographic parameters in clinical routine and research.
Automatic determination of radiographic parameters of the coronal balance could provide the efficiency and accuracy required in clinical assessment and treatment of AIS. Algorithms based on artificial intelligence (AI) represent an approach for the rapid and independent computation of essential radiographic parameters, which could improve measurement validity and streamline radiology workflows. 11 In 2013, Langensiepen et al conducted a review of the results of eleven promising studies using novel, semi-automatic, app-controlled, and automatic measurement approaches. 12 However, all of the methods required the manual placement of anatomical landmarks or identification of anatomical regions of interest. Although few recent publications have presented the results of fully automated assessment procedures,13,14 they are limited to Cobb angles in the coronal plane and thus omit other essential parameters, such as coronal and shoulder balance or T1-tilt.13,14 Furthermore, they lack a comprehensive comparison of inter- and intra-rater reliability in larger patient cohorts with different physicians, thus impeding the statistical interpretation of the findings due to missing high-quality reference measurements by experienced raters.
Therefore, the goal of the present research study is to develop and scientifically validate a fully automated AI-based algorithm able to determine several essential coronal parameters by comparing its predictions with assessments of 2 experienced physicians in 100 AIS patients.
Methods
This study hypothesized that a novel AI-based algorithm can determine essential coronal parameters fully automatically with excellent reliability (ICC >.75) compared to experienced physicians from 2 different scoliosis centers.
Study Design and Patient Selection
In this study, 100 prospectively collected preoperative AP whole spine X-rays of AIS patients from a database of a single scoliosis center were used for retrospective measurements of coronal parameters. The images were taken preoperatively in a neutral standing position and recorded between May 2019 and September 2021 by experienced technicians using the EOS® system (sterEOS imaging, Paris, France, version: 1.8.7.66 R; 2D postural assessment). At the time of surgery, the average age of the patients (80♀/20♂) was 14.6 years (range: 12-17 years) and the mean BMI was 20.4 kg/m2 (13.8-34.6 kg/m2).
The de-identified radiographs in DICOM format were made available to 2 physicians (rater 1 and rater 2) experienced in the radiographic evaluation of AIS patients and performing measurements in daily routine. Using a standardized digital imaging tool (Surgimap® Version 2.3.2.1), both raters manually measured coronal parameters separately and independently from each other. Rater 1 conducted the measurements independently twice (rater 1A, B) to allow the assessment of intra-rater reliability in addition to inter-rater reliability between both raters. All measurements from both human raters were blinded to each other, as these values served as the ground truth to validate the algorithm’s predictions.
For this research study, no additional X-ray imaging was performed than the clinical standard. This study adhered to legal data protection regulations and to the 1964 Helsinki Declaration, its amendments, and other equivalent ethical standards. Patients signed written informed consent and approval was granted by the Ethics Committee of the doctors’ chamber of Schleswig Holstein (registry number 037/18 m).
Evaluated Coronal Parameters
The following coronal radiographic parameters were evaluated in accordance with scientific literature.15-19 • T1-tilt (°): angle between a line along the superior endplate of T1 and the horizontal line (positive sign: patient’s right side of the T1-vertebral body (VB) is located more cranially than the left side). • Clavicle angle (CA, °): angle between the line passing through the most cranial points of both clavicles and the horizontal plane (positive sign: patient’s right shoulder is located more cranially than the left side). • Coronal balance (CB, mm): distance between the central sacral vertical line (CSVL) and the C7 plumb line (C7PL; positive sign: the C7PL was located on the right side of the CSVL). • Cobb angles (C, °): angles between the superior endplate of a cranial VB and the inferior endplate of a caudal VB. In accordance with the Lenke classification,
3
Cobb angles were classified as: o proximal thoracic (C-PT), if the curve’s apex lied between T2 and T6, o thoracic (C-T; apex between vertebra T6 and disc T11/T12), o thoracolumbar/lumbar (C-TL; apex between disc T12/L1 and L4). • Lumbar modifier (LM): classification in grades A, B or C depending on the CSVL intersection with the lumbar apex
3
: o ‘A’: CSVL oriented between the pedicles, o ‘B’ CSVL touched a pedicle and o ‘C’ CSVL lateral to the lumbar apex.
AI-Algorithm
A fully automated algorithm was developed, containing a deep learning convolutional neural network to at first identify different anatomical structures in AP spine X-rays (“phase 1”) and subsequently compute parameters based on the network’s output (“phase 2”). As a result, given an AP full spine X-ray image as input, the algorithm computes all aforementioned coronal parameters fully automatically and visualizes them in a proving image as the output (Figure 1). Schematic representation of the pipeline of the presented algorithm showing an AP full spine X-ray as input (left); segmentation of anatomical structures of interest (middle left); computation of parameters based on the segmentation results (middle right); visualization of the coronal parameters (right).
Phase 1: Segmentation
Preprocessing and Data Enhancement
The goal of the developed segmentation model was to localize and classify the sacrum, both clavicles and all VBs respectively to their designated region (cervical, thoracic, lumbar; Figure 1 – phase 1). In a first preprocessing step, to ensure better visibility of bony structures, brightness and contrast were enhanced on all training, validation, and test images using the window width and center information from the respective DICOM tags and adaptive histogram equalization.
Both training and test datasets were comprised of anonymized preoperative and postoperative, full spine and full body AP images, that were obtained from 3 different clinical sites. The majority of patients in both datasets suffered from spinal coronal deformities. Ground truth segmentation information for all visible anatomical entities for the training and test dataset were created by trained medical professionals using a web-based annotator. Annotations consisted of bounding polygon masks around each anatomical structure and the respective labels (cervical/thoracic/lumbar VB, sacrum, clavicle, first rib, femoral head). The final training dataset consisted of 271 full spine and 161 full body images.
Training
A Mask Region-Based Convolutional Neural Network (RCNN) was trained for 370 epochs with a constant learning rate of .002 using PyTorch framework on a NVIDIA GeForce GTX 1080 GPU. 20 The training was initialized with pretrained weights from an instance segmentation model trained on the publicly available ImageNet model zoo 21 to improve the robustness of the model.22,23 During training, flipping augmentation was applied to half of the randomly selected training images. Inference results of the model consisted of structure localization (segmentation mask, bounding box), assigned category and certainty score.
Phase 2: Parameter Determination
In phase 2, predictions from the trained model were used to compute coronal parameters using calculus and geometrical interrelations. Initially, as the model only distinguishes between cervical, thoracic, and lumbar VBs, the algorithm labelled all VBs from bottom to top assuming 5 lumbar, 12 thoracic and 7 cervical VBs. Subsequently, a spline curve was fitted through the centers of mass of all VBs as shown in Figure 2. (a) Midpoints of all VB segmentations; (b) spline fit through midpoints; (c) extremum (blue) and points of inflection (yellow) derived from the spline fit and considered end vertebra according to the Lenke Cobb area classification (green).
For Cobb angles, apexes and end vertebra were identified by computing extrema and points of inflections on the spline, respectively. In case of less than 4 detected inflection points, additional end vertebra locations were defined using the Lenke’s classification for Cobb angle areas 3 (Figure 2(c)).
Statistical Analysis
To ensure comparability to similar studies,13,24,25 the degree of absolute agreement between raters was determined using two-way mixed (intra-rater reliability) or random (inter-rater) single-measure Intra-class Correlation Coefficients 26 in addition to the Pearson correlation coefficient (“r”). Furthermore, the mean error and its 95% confidence interval (CI), the standard deviation (SD) and the root-mean-square-error (RMSE) were evaluated. The LM was examined with overall accuracy and a Cohen’s kappa statistical measure 27 for inter-rater agreement using quadratic weights. ICC values and Cohen’s kappa above .75 were considered excellent.28,29 All statistical evaluations were performed in Python 3 programming language. 30
Results
All 100 X-ray images could be evaluated successfully by human raters and AI, which allowed a comparison of 100 radiographic measurements for each radiographic parameter.
Intra-rater reliability for rater 1A vs rater 1B (n = 100).
Inter-rater reliability of rater 1A, rater 1B vs rater 2 (n = 100).
The inter-rater reliability analysis (exemplarily rater 1A vs rater 2) shows highest agreement for the CB (ICC: .99), lowest agreement for C-PT (ICC: .85) and a Cohen’s kappa of .87 for LM. Mean errors and RMSEs are lowest for CA (mean error: .0°, RMSE: .6°) and highest for C-PT (mean error: 5.5°, RMSE: 7.2°).
Inter-rater reliability of rater 1A, 1B and rater 2 vs the algorithm (n = 100).

Scatterplots exemplarily displaying the correlation between the measurements of rater 1A and the AI-algorithm for all 100 evaluations.
Discussion
The reliable evaluation of AP X-ray images is essential for the diagnosis, surgical planning, and post-interventional assessment of AIS patients. The presented results confirm the study hypothesis and demonstrate that the novel AI-based algorithm is able to compute essential preoperative radiological parameters with excellent ICC values (>.75;27,28) and RMSE values similar to those of experienced physicians. The reliability and speed with which AI is able to conduct measurements supports its application in clinical routine as well as the analysis of large datasets for research purposes.
Previous studies have favored Cobb angles as the primary object of investigation for determining the reliability of AI-generated and human-generated measurements. For example, the investigation from Pan et al 13 on the reliability of a Mask R-CNN model against manual measurements by radiologists generated excellent ICC values at .85. However, their model did not differentiate between primary and secondary Cobb angles, recording only 14 double curve scoliosis. A further limitation was the misclassification of 3 scoliotic curves as 1 single combined curve. In line with the present study, Galbusera et al 24 also used EOS images. However, their evaluation of a single Cobb angle (the most severe curve) relied on a smaller, 50-patient cohort, and Cobb angle measurements were based on pre-defined end vertebrae not identified by humans and the AI-algorithm independently. Furthermore, they did not evaluate intra- and inter-rater reliability between different human raters (1 rater approach), and quantified agreement between human assessment and AI with a Pearson’s correlation coefficient value (r = .83) and not with an ICC. 24 Sun et al presented a promising CNN-based approach to automatically determine the Cobb angles on a smaller cohort of 36 images, resulting in ICC values of .99. 31 The specific type of ICC used for the analysis was however not reported. Horng et al 25 investigated the automatic measurement of radiographs of the whole spine including a smaller sample of 35 images. The ICC for Cobb angles were determined above .93, similar to the present study for C-T. 25 The application of different statistical approaches (eg, no usage of ICCs as recommended or the unclear definition of the used ICC model (absolute agreement vs consistency)) and a different measurement methodology complicates a rigorous comparison in the scientific literature. The present study addresses these limitations in 3 ways. First, it conducts an in-depth intra- and inter-rater reliability analysis using in particular ICCs and measurements of 2 independent physicians as ground truth values. Second, the study evaluates a comparatively large patient cohort of 100 patients. Lastly, it evaluates additional parameters relevant to the characterization of scoliotic deformities including a subdivision of Cobb angles. For all the selected parameters, ICC values for inter- and intra-rater reliability of human raters could be determined between .85 and 1.0, substantiating the quality of the applied ground truths and facilitating a more rigorous statistical comparison than preceding studies. Based on this profound human measured reference, the AI analysis resulted in excellent agreement (>.75;28,29).
Consistently, for the inter-rater reliability between both human raters and between humans and AI-algorithm, the highest ICC values were typically calculated for the parameters CA and the CB (ICCs between .95 and .99; Tables 2 and 3). The high reliability between human raters and between physicians and AI for these parameters is due to the visibility and identifiability of the requisite anatomical entities (clavicles, C7 vertebrae, sacrum) in the images. The smallest ICC values were found for the C-PT (ICCs inter-rater reliability humans: .85-.88, Table 2; AI vs human: .78-.84, Table 3), when comparing results between physicians and between AI and human raters. As demonstrated by Goldberg et al, 9 the proximal curve is often short-stretched and unpronounced, complicating an unambiguous identification of each end vertebra of the scoliotic curves. In addition, in the proximal thoracic region, the proximity and/or potential superimposition of adjacent anatomical structures such as the scapulae, clavicles, ribs, sternum, and the mediastinum make the identification of anatomical entities more difficult and thus impede measurements. Therefore, standardized, and deterministic AI-based analyses could in the future also enable the efficient, consistent assessment of even more demanding measurements in day-to-day clinical practice.
Despite the advances the study makes in using a fully automatic method for the reliable determination of several coronal parameters, it demonstrates some limitations. The AI-algorithm was evaluated to measure exclusively coronal parameters in this study, although parameters of sagittal spinal balance are also instrumental in the treatment of AIS patients. In fact, the algorithm has already been proven to reliably determine sagittal parameters in mainly non-scoliotic patients.11,32 However, a validation study on sagittal parameters in AIS patients with pronounced coronal deformities is more challenging and necessitates further investigation in a future study. A further limitation of the current scientific literature and the current study is the analysis of exclusively preoperative images. Postoperative images with spinal implants (eg, cages, screws, etc.) can obscure spinal bony structures and thus might complicate automated analysis, which requires further development and validation. Furthermore, the study relied on a mono-centric approach, evaluating EOS images from a single clinical site. Future studies should source images from more spine centers with images from different X-ray machines and a more heterogenous patient cohort (eg, de novo and other secondary scoliosis or spinal deformities) to account for the diversity of clinical routine.
In conclusion, the study thoroughly evaluated a novel developed AI-based algorithm for coronal radiographic parameters that alleviates physicians from time-consuming routine work and error-prone measurements, thereby allowing the analysis of large datasets for research purposes (eg, in the large registry studies) that ultimately improve the quality of care for patients suffering from spinal deformities.
Footnotes
Acknowledgments
The authors greatly appreciate the support of the AO Spine International (AO Spine Start-up Grant) for this research project. The authors would also like to thank Preston Melchert for his excellent editing and proofreading of the manuscript.
Declaration of conflicting interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
