Sage Journals: Discover world-class research

Abstract

Study Design

Retrospective, mono-centric cohort research study.

Objectives

The analysis of cervical sagittal balance parameters is essential for preoperative planning and dependent on the physician’s experience. A fully automated artificial intelligence-based algorithm could contribute to an objective analysis and save time. Therefore, this algorithm should be validated in this study.

Methods

Two surgeons measured C2-C7 lordosis, C1-C7 Sagittal Vertical Axis (SVA), C2-C7-SVA, C7-slope and T1-slope in pre- and postoperative lateral cervical X-rays of 129 patients undergoing anterior cervical surgery. All parameters were measured twice by surgeons and compared to the measurements by the AI algorithm consisting of 4 deep convolutional neural networks. Agreement between raters was quantified, among other metrics, by mean errors and single measure intraclass correlation coefficients for absolute agreement.

Results

ICC-values for intra- (range: .92-1.0) and inter-rater (.91-1.0) reliability reflect excellent agreement between human raters. The AI-algorithm could determine all parameters with excellent ICC-values (preop:0.80-1.0; postop:0.86-.99). For a comparison between the AI algorithm and 1 surgeon, mean errors were smallest for C1-C7 SVA (preop: −.3 mm (95% CI:-.6 to −.1 mm), post: .3 mm (.0-.7 mm)) and largest for C2-C7 lordosis (preop:-2.2° (−2.9 to −1.6°), postop: 2.3°(-3.0 to −1.7°)). The automatic measurement was possible in 99% and 98% of pre- and postoperative images for all parameters except T1 slope, which had a detection rate of 48% and 51% in pre- and postoperative images.

Conclusion

This study validates that an AI-algorithm can reliably measure cervical sagittal balance parameters automatically in patients suffering from degenerative spinal diseases. It may simplify manual measurements and autonomously analyze large-scale datasets. Further studies are required to validate the algorithm on a larger and more diverse patient cohort.

Keywords

artificial intelligence deep learning sagittal balance automatic analysis x-ray cervical spine

Introduction

Neck pain is increasingly common worldwide¹ and is the fourth leading cause of work disability,² resulting in high costs for the health care system. The sagittal profile of the cervical spine plays a central role in the balance of the head and the alignment of the horizontal field of vision.³ In the last decade, research focusing on sagittal alignment of the cervical spine has increased rapidly.⁴ Several studies have shown an impact of kyphotic imbalance on postoperative outcome of cervical myelopathy,^5,6 and malalignment may affect the etiology of adjacent segment pathology.⁷ Furthermore, the correlation of the sagittal profile with health-related quality of life has been discussed,⁸ emphasizing its importance for surgical planning, selection of the surgical approach, and postoperative outcome.⁹

The assessment of cervical sagittal alignment has become an essential part of routine clinical practice and scientific research. Analysis of cervical alignment parameters is important for preoperative planning and postoperative assessment since the goal of surgery is to restore the physiological sagittal balance. There are many radiological parameters to be assessed in lateral cervical X-ray images ,³ which are typically measured manually and individually in clinical routine. The current manual measurement technique for sagittal balance analysis is time-consuming, occasionally inaccurate, and heavily dependent on the experience of the measuring physician.^10,11 The use of a fully automated artificial intelligence (AI)-based algorithm to determine sagittal balance parameters could save time in clinical practice, contribute to an objective analysis of pathologies, and therefore improve preoperative planning and postoperative failure recognition.

In the past few years, several AI methods have been proposed for the automation of spinal analysis, including algorithms for the measurement of clinically relevant parameters from plain X-rays.¹² However, most of these studies used full spine X-rays^12-16 or lumbar spine X-rays,^17,18 focusing on global sagittal alignment parameters. Yeh et al.¹⁵ used an AI algorithm to automatically measure 3 cervical spine parameters (cervical lordosis, cervical sagittal vertical axis (SVA) and T1 slope) in full-spine lateral X-rays, yet all reported mean errors lay above 5.3° and 1.1 mm. To the best of our knowledge, no algorithm for the automated analysis of sagittal balance parameters in pre- and postoperative cervical lateral X-rays has been proposed and thoroughly validated.

Furthermore, intervertebral cages or disk replacements, screws and other instruments encountered in clinical routine complicate the automatic (and manual) measurement of postoperative cervical radiographic parameters. However, given the relationship between postoperative cervical alignment parameters and postoperative outcome,^8,19 it is essential to analyze postoperative images with comparable measurement accuracy to preoperative images. Previous automation studies for radiographic parameters are typically limited to preoperative images and thus lack validation for the use of these algorithms on postoperative images.

Therefore, the aim of the present study is to create a new, fully automated method for determining the parameters of cervical sagittal balance pre- and postoperatively and to validate it against independent measurements of physicians. We hypothesize that an AI-based algorithm can systematically determine radiological parameters of sagittal cervical balance with high reliability and accuracy in comparison to measurements of experienced physicians in lateral cervical X-ray images.

Methods

Patient Population and Patient Selection

For this retrospective single center study, all patients who were surgically treated via an anterior surgical approach between the years 2010 and 2020 were screened. Patients were selected from a clinical database using OPS (official German classification for operations, procedures and general medical measures) code 5-030.7 (access to cervical spine from ventral). The complete selection procedure is visualized in Figure 1. Initially, a total of 244 patients were found. Following this, only patients who had an elective anterior cervical discectomy and fusion (ACDF) as well as cervical disk arthroplasty (CDA) because of degenerative conditions of the cervical spine were selected. Only patients for whom the upper plate of T1 could be identified in both images (pre- and postoperative) were included. Patients were excluded who had undergone previous cervical spine surgery, suffered from fractures (eg, treated with a vertebral body replacement) or spinal infection, or had additional posterior surgical access. This led to a final sample size of 129 patients.

Figure 1.

Flowchart visualizing the inclusion criteria and selection process.

The present study was reviewed and approved by the ethics reviewer committee of the university hospital of Jena, Germany under registry number 2020-1946. Resulting from the retrospective and anonymized study design an informed consent was not necessary according to the ethics committee approval.

Image Acquisition and Measured Cervical Parameters

For each patient, lateral cervical X-rays were taken in neutral standing position prior and after surgery. All images were anonymized before analysis. The following cervical parameters were evaluated on all pre- and postoperative images:

• C1–C7 SVA: distance (in mm) between the posterior superior corner of C7 and the plumbline from the C1 anterior tubercle. The distance was defined to be positive if the C1 anterior tubercle is anterior to the posterior superior corner of C7 and negative otherwise.

• C2–C7 SVA: distance (in mm) between the posterior superior corner of C7 and the plumbline from the centroid of C2. The distance was defined to be positive if the C2 centroid is anterior to the posterior superior corner of C7 and negative otherwise.

• C2–C7 lordosis: angle (in degrees) between the lower endplate of C2 and the lower endplate of C7, with the sign indicating lordosis (+) and kyphosis (−).

• C7 slope: angle (in degrees) between a horizontal line and the superior endplate of C7.

• T1 slope: angle (in degrees) between a horizontal line and the superior endplate of T1.

The pixel spacing information saved in the DICOM tags was used to convert the distance from pixel space to mm. An exemplary radiographic measurement by the proposed AI algorithm is shown in Figure 2 (“Output”), which is consistent with the human measurements.

Figure 2.

Workflow of the fully automated AI algorithm. The image is preprocessed before serving as input to the segmentation network. Crops of C1, C2, C7, and T1 are generated based on the segmentation masks. Landmarks are predicted on these cropped regions of interest to fit lines and measure 5 cervical balance parameters.

Manual Measurements by Physicians

Measurements were performed with the free and open-source software Horos Viewer (Nimble Co LLC, Annapolis, MD, USA) on all preoperative and postoperative images by 2 experienced physicians (rater 1 and 2). Both raters were extensively trained in radiographic angle determination and perform measurements in daily routine. All measurements were completed twice at different time points to assess intra-rater reliability in addition to inter-rater reliability. All measurements were performed in a completely independent and blinded manner. Raters had access to magnification, contrast, and brightness adjustment tools. Manual raters used the above-mentioned definition of cervical parameters for measurement. In cases of degeneration due to osteophytes detection of endplates affected. In these cases, for identification of the endplate’s osteophytes were excluded by extrapolating the points where endplate and anterior or posterior wall cross each other.

Automated Measurements by the AI Algorithm

A fully automatic AI-algorithm for the measurement of cervical sagittal balance parameters was developed to determine the above-mentioned sagittal balance parameters on lateral X-rays (C1–C7-SVA, C2-C7-SVA, C-C7 lordosis, C7 slope, T1 slope). The algorithm consists of 4 interlinked deep convolutional neural networks (CNNs) that were trained on anonymized lateral cervical spine X-ray images. The training images were completely independent of the 258 evaluation images (pre and postoperative images from 129 patients), to demonstrate unbiased performance of the algorithms during final validation. The fully automated pipeline starting with the input of an X-ray image and resulting in the output of the visualized cervical sagittal balance parameters is displayed in Figure 2.

Preprocessing

First, the input DICOM images were anonymized and converted to plain images. During conversion, each DICOM image was normalized by making use of the window width and window center information to adjust contrast and brightness, such that the vertebral bodies, spinous processes and various types of implants were clearly visible.

Segmentation

Secondly, all visible spinous structures were detected and segmented to localize the cervical vertebral bodies in the image. For this, a CNN architecture with mask-RCNN backbone for instance segmentation was used .²⁰ The task of the network was to locate and classify the cervical vertebra C1-C7, and, if visible, the first thoracal vertebral body T1. If present, intervertebral cages and artificial disks were segmented as separate categories.

Manual training annotations consisted of bounding polygon masks around the respective structures and were conducted by trained medical personnel. The training set for the segmentation model comprised 519 images (175 preoperative, 334 postoperative) from 18 different clinical sites. To achieve generalizability, it included images with various types of implants, as well as images in neutral, flexion, and extension position. Data augmentation was applied by horizontal flipping of half of the training images. 10% of training images were randomly selected as a test set to monitor the loss over the course of training. The segmentation network and workflow were implemented in the PyTorch framework .²¹ Training ran on a NVIDIA GeForce GTX 1080 GPU for 100 epochs with a constant learning rate of .0012. Before training, the network was initialized with pre-trained weights from the publicly available PyTorch model zoo (https://pytorch.org/serve/model_zoo.html).

During inference, each vertebral body C1-T1 was allowed to appear exactly 1 time. If the same label was predicted multiple times, the mask with highest confidence score was selected as the correct prediction while the other masks were discarded. Generally, masks with a confidence score below 55% were discarded as erroneous predictions.

Landmark Detection

Next, the segmented structures were used to determine regions of interest: rectangular crops around the segmented vertebral bodies were automatically created and served as input to 3 different landmark detection models. The task for the C1 landmark detection model was to predict 1 landmark on the anterior tubercle (see Figure 2 “landmark placement”). The C2 landmark detection model was trained to predict 3 landmarks on the inferior endplate, generally neglecting osteophytes of C2 vertebral body. The location of a fourth landmark on C2 was determined by computing the centroid of the C2 segmentation mask. The third landmark detection model was trained to predict 3 landmarks on each of the superior and inferior endplates of vertebrae C3-T1. Therefore, 3 landmark placement models were trained to account for different number of landmarks to be predicted on the vertebral bodies.

To create training material, trained medical personnel manually annotated lateral cervical spine X-ray images by placing landmarks on all visible vertebral bodies in a custom measurement interface implemented for this purpose. The C1 landmark detection model was trained on 206 crops (87 preoperative, 119 postoperative) from 12 different sites. The C2 landmark detection model was trained on 200 crops (82 preoperative, 118 postoperative) from 12 different sites. The C3-T1 landmark detection model was trained on crops of all detected C3, C4, C5, C6, C7, and T1 vertebral bodies. This amounted to a training set of 1042 crops detected in the 232 images (94 preoperative, 138 postoperative) from 7 different sites. During training of each model, 10% of crops were randomly selected as a validation set to monitor the loss after each training epoch. Training crops were randomly rotated and/or scaled by random factors for the purpose of data augmentation. Before serving as model input, the images were squared and downscaled to 256 × 256 pixels. The CNN architecture based on UNet ²² was adapted with the final layer providing landmarks. It was implemented using the TensorFlow framework²³ and training ran on a NVIDIA GeForce GTX 1080 GPU for 100 epochs.

The training data set contains various indications and surgical instrumentations (eg, multilevel fusion surgeries with and without posterior instrumentation, cages, plates, total disc replacements, etc.). Furthermore, the presented evaluation images were not part of the training images used. Thus, the presented evaluation is technically independent from the training material, in particular to reveal and overcome potential problems of under- or overfitting.

Parameter Determination

As the final step, the predicted segmentation masks and landmarks were used to compute all cervical balance parameters using plain calculus in accordance with the above-mentioned definitions.

Statistical Analysis

Multiple statistical evaluation metrics were computed to allow an in-depth comparison to previous studies. The same metrics were computed for comparisons between 2 human raters (R1 and R2) and for comparisons between human raters and the AI algorithm (R-AI). The first (R1a, R2a) and second measurements (R1b, R2b) of each human rater are referred to by letters.

To determine accuracy, mean differences between measurements were computed along with their standard deviations (STD) and their 95% confidence interval (CI). Furthermore, root-mean-square error (RMSE) were calculated. Inter- and intra-rater-reliability were assessed using two-way mixed or random single-measure intra-class correlation coefficients (ICC) for absolute agreement, respectively.^24-26 In accordance with Cicchetti,²⁷ ICC-values larger than .75 were considered excellent . Pearson’s correlation coefficient r served as an additional metric for agreement. All statistical evaluations were performed in Python 3 programming language.²⁸

Results

Patient Cohort Statistics

Preoperative and postoperative sagittal cervical spine radiographs of 129 patients were included in the sample consisting of 84 women and 45 men with a mean age of 53 years. There were 59 mono-segmental, 56 bi-segmental, 13 three-segmental and 1 four-segmental surgeries. 47 patients underwent instrumentation with a PEEK-cage and in 82 cases titan implants were inserted (55 patients with titan cages and 27 patients with CDA). Patients who underwent multi-segmental instrumentation received the same implant in every segment. Figure 3 displays the different types of implants used.

Figure 3.

Absolute numbers of different implant types used for anterior cervical discectomy and fusion or cervical disk arthroplasty in the present patient cohort.

Human vs Human Intra- and Inter-Rater Reliability

Tables 1 and 2 display the results of the analysis of intra- and inter-rater reliability between human raters, respectively.

Table 1.

Intra-Rater Reliability of the Manual Measurements; Rater 1a vs Rater 1b; Rater 2a vs Rater 2b; n = 129.

	Statistical method	C2-C7 lordosis (°)	C1-C7 SVA¹ (mm)	C2-C7 SVA (mm)	C7 slope (°)	T1 slope (°)
Intra-rater Reliability (Rater 1a vs Rater 1b)
Preop	ICC² (95% CI³)	.99 (.98-.99	.96 (.94-.97)	.99 (.98-.99)	.97 (.94-.98)	.97 (.95-.98)
	Mean error (95% CI)	−.5 (−.9–-.2	.9 (.2-1.6)	−.4 (−.7–-.1)	.9 (.5-1.3)	.6 (.2-1.0)
	STD⁴	1.9	4.2	1.8	2.2	2.2
	RMSE⁵	2.0	4.2	1.8	2.4	2.3
Postop	ICC (95% CI)	.98 (.97-.99)	1.0 (1.0-1.0)	1.0 (1.0-1.0)	.96 (.95-.97)	.92 (.89-.94)
	Mean error (95% CI)	.2 (−.2-.7)	.2 (.1-.4)	.2 (.1-.4)	−.1 (−.5-.3)	.6 (.0-1.2)
	STD	2.4	0.9	0.6	2.3	3.5
	RMSE	2.4	1.0	0.7	2.3	3.6
Intra-rater reliability (rater 2a vs Rater 2b)
Preop	ICC (95% CI)	.98 (.97-.98)	.99 (.99-1.0)	.98 (.98-.99)	.96 (.94-.97)	.92 (.90-.95)
	Mean error (95% CI)	−.1 (−.7-.4)	.1 (−.2-.4)	−.1 (−.5-.3)	.5 (.1-1.0)	.0 (−.6-.7)
	STD	3.1	1.8	2.2	2.6	3.6
	RMSE	3.1	1.8	2.2	2.6	3.6
Postop	ICC (95% CI)	.98 (.97-.98)	.99 (.99-1.0)	.99 (.99-1.0)	.95 (.92-.96)	.95 (.92-.96)
	Mean error (95% CI)	.2 (−.2-.7)	.1 (−.0-.3)	.3 (.1-.5)	.5 (−.1-1.0)	.4 (−.1-.9)
	STD	2.7	0.9	1.1	1.9	2.9
	RMSE	2.7	0.9	1.1	1.9	2.9

Table 2.

Inter-rater Reliability of the Manual Measurements; Rater 1a vs Rater 2a; Rater 1b vs Rater 2b; n = 129.

	Statistical method	C2-C7 lordosis (°)	C1-C7 SVA6 (mm)	C2-C7 SVA (mm)	C7 slope (°)	T1 slope (°)
Inter-rater Reliability (Rater 1a vs Rater 2a)
Preop	ICC⁷ (95% CI⁸)	.96 (.94-.97)	.96 (.94-.97)	.99 (.98-.99)	.93 (.85-.96)	.91 (.85-.94)
	Mean error (95% CI)	.1 (−.6-.8)	.5 (−.2-1.3)	.2 (−.1-.6)	1.7 (1.2-2.3)	1.5 (.9-2.2)
	STD⁹	3.9	4.2	2.0	3.2	3.7
	RMSE¹⁰	3.9	4.2	2.1	3.7	4.0
Postop	ICC (95% CI)	.95 (.93-.97)	.99 (.99-1.0)	.99 (.99-1.0)	.95 (.92-.97)	.91 (.88-.94)
	Mean error (95% CI)	1.1 (.4-1.7)	−.7 (−.9–-.5)	−.2 (−.4-.0)	.9 (.5-1.4)	.5 (−.1-1.2)
	STD	3.6	1.2	1.2	2.6	3.7
	RMSE	3.8	1.4	1.2	2.7	3.8
Inter-rater reliability (rater 1b vs Rater 2b)
Preop	ICC (95% CI)	.97 (.95-.98)	1.0 (.99-1.0)	.98 (.95-.98)	.95 (.91-.97)	.93 (.90-.95)
	Mean error (95% CI)	.5 (−.1-1.1)	−.2 (−.5–-.0)	.5 (.1-.9)	1.3 (.9-1.8)	.9 (.3-1.5)
	STD	3.5	1.4	2.2	2.5	3.3
	RMSE	3.5	1.4	2.2	2.9	3.4
Postop	ICC (95% CI)	.96 (.95-.98)	.99 (.98-1.0)	.99 (.99-1.0)	.95 (.91-.97)	.92 (.89-.94)
	Mean error (95% CI)	1.1 (.5-1.6)	−.8 (−1.0–-.6)	−.2 (−.4–-.0)	1.1 (.6–-1.5)	.4 (−.2-1.0)
	STD	3.1	1.2	1.1	2.7	3.5
	RMSE	3.3	1.5	1.1	2.9	3.5

Agreement within raters (intra-rater reliability) was highest for C2-C7 SVA (ICC_PreOP = .99, ICC_PostOP = 1.0 for R1a vs R1b; ICC_PreOP = .98, ICC_PostOP = .99 for R2a vs R2b) and lowest for T1 slope (ICC_PreOP = .97, ICC_PostOP = .92 for R1a vs R1b; ICC_PreOP = .92, ICC_PostOP = .95 for R2a vs R2b). Pre- and postoperative RMSEs ranged between 2.0-3.1° for C2-C7 lordosis, .9-4.2 mm for C1-C7-SVA, .7-2.2 mm for C2-C7 SVA, 1.9-2.6° for C7 slope, and 2.3-3.6° for T1 slope.

Consistently, agreement between raters (inter-rater reliability) was highest for C2-C7 SVA (ICC_PreOP = ICC_PostOP = .99 for R1a vs R2a; ICC_PreOP = .98, ICC_PostOP = .99 for R1b vs R2b) and lowest for T1 slope (ICC_PreOP = ICC_PostOP = .91 for R1a vs R2a; ICC_PreOP = .93, ICC_PostOP = .92 for R1b vs R2b). The RMSEs were slightly larger between raters than within raters: they ranged between 3.3-3.9° for C2-C7 lordosis, 1.4-4.2 mm for C1-C7 SVA, 1.1-2.2 mm for C2-C7 SVA, 2.7-3.7° for C7 slope, and 3.4-4.0° for T1 slope.

Human vs AI Inter-Rater Reliability

The comparisons between the manual measurements and the automated measurements by the AI algorithm are shown in Table 3. The algorithm was able to detect C1, C2 and C7 and thereby compute the parameters C1-C7 SVA, C2-C7 SVA, C7 slope, and C2-C7 lordosis in 128 of 129 preoperative images (99%) and in 127 of 129 postoperative images (98%). In the 3 failed predictions, C7 was not detected by the segmentation network, preventing the computation of all parameters involving C7. The detection rate for T1 was lower: T1 slope could be computed in 48% of preoperative and 51% of postoperative images. The detailed evaluation of the segmentation model yielded a dice score of .91 and .87 for pre- and postoperative images in the test dataset respectively.

Table 3.

Inter-rater Reliability Between AI Method and Manual Measurements; n = 128 for Preop, n = 127 for Postop C2-C7 Lordosis, C1-C7 SVA, C2-C7 SVA, and C7 Slope; n = 62 for Preop, n = 66 for Postop T1 Slope.

	Statistical method	C2-C7 lordosis (°)	C1-C7 SVA¹¹ (mm)	C2-C7 SVA (mm)	C7 slope (°)	T1 slope (°)
Inter-rater Reliability (AI Method vs Rater 1a)
Preop	ICC¹² (95% CI¹³)	.95 (.91-.97)	.96 (.94-.97)	.99 (.97-.99)	.95 (.93-.97)	.84 (.75-.90)
	Mean error (95% CI)	−1.7 (−2.4–-1.1)	−1.2 (−2.0–-.5)	.9 (.6-1.2)	.6 (.1-1.1)	.8 (−.1-1.6)
	STD¹⁴	3.7	4.2	1.8	2.7	3.4
	RMSE¹⁵	4.1	4.4	2.0	2.7	3.5
	Pearson’s r (P-value)	.96 (<.001)	.96 (<.001)	.99 (<.001)	.96 (<.001)	.85 (<.001)
Postop	ICC (95% CI)	.91 (.80-.95)	.99 (.99-.99)	.99 (.97-.99)	.93 (.90-.95)	.87 (.80-.92)
	Mean error (95% CI)	−2.6 (−3.3–-1.8)	.1 (−.2-.4)	.8 (.6-1.1)	.6 (.1-1.1)	.3 (−.5-1.1)
	STD	4.3	1.8	3.0	3.0	3.3
	RMSE	5.0	1.8	3.1	3.1	3.3
	Pearson’s r (P-value)	.94 (<.001)	.99 (<.001)	.99 (<.001)	.94 (<.001)	.87 (<.001)
Inter-rater reliability (AI method vs Rater 1b)
Preop	ICC (95% CI	.95 (.87-.98)	1.0 (.99-1.0)	1.0 (.99-1.0)	.94 (.87-.97)	.85 (.76-.91)
	Mean error (95% CI)	−2.2 (−2.9–-1.6)	−.3 (−.6–-.1)	.5 (.3-.7)	1.5 (1.1-2.0)	1.0 (.2-1.9)
	STD	3.5	1.3	1.1	2.6	3.3
	RMSE	4.2	1.4	1.2	3.0	3.5
	Pearson’s r (P-value)	.96 (<.001)	1.0 (<.001)	1.0 (<.001)	.96 (<.001)	.87 (<.001)
Postop	ICC (95% CI)	.94 (.82-.97)	.99 (.98-.99)	.99 (.95-.99)	.94 (.92-.96)	.85 (.77-.91)
	Mean error (95% CI)	−2.3 (−3.0–-1.7)	.3 (.0-.7)	1.0 (.8-1.3)	.5 (0-1.0)	.4 (−.5-1.3)
	STD	3.5	2.0	1.4	2.7	3.7
	RMSE	4.2	2.0	1.7	2.8	3.7
	Pearson’s r (P-value)	.96 (<.001)	.99 (<.001)	.99 (<.001)	.94 (<.001)	.86 (<.001)
Inter-rater reliability (AI method vs Rater 2a)
Preop	ICC (95% CI)	.94 (.90-.96)	.99 (.99-1.0)	.99 (.95-1.0)	.89 (.72-.95)	.80 (.61-.89)
	Mean error (95% CI)	−1.6 (−2.4–-.8)	−.7 (−1.0–-.4)	1.1 (.9-1.	2.3 (1.7-3.0)	2.0 (1.0-2.9)
	STD	4.5	1.6	1.3	3.5	3.6
	RMSE	4.8	1.8	1.7	4.2	4.1
	Pearson’s r (P-value)	.95 (<.001)	.99 (<.001)	.99 (<.001)	.93 (<.001)	.84 (<.001)
Postop	ICC (95% CI)	.92 (.88-.95)	.99 (.98-.99)	.99 (.98-.99)	.93 (.84-.96)	.90 (.84-.94)
	Mean error (95% CI)	−1.6 (−2.3–-.8)	−.6 (−1.0–-.2)	.6 (.3-.9)	1.6 (1.1-2.1)	.5 (−.2-1.3)
	STD	4.4	2.1	1.7	2.8	3.1
	RMSE	4.6	2.2	1.8	3.2	3.1
	Pearson’s r (P-value)	.94 (<.001)	.99 (<.001)	.99 (<.001)	.94 (<.001)	.91 (<.001)
Inter-rater reliability (AI method vs Rater 2b)
Preop	ICC (95% CI)	.94 (.90-.96)	.99 (.99-1.0)	.98 (.96-.99)	.89 (.52-.96)	.82 (.68-.89)
	Mean error (95% CI)	−1.8 (−2.5–-1.0)	−.6 (−.8–-.3)	1.0 (.7-1.4)	2.8 (2.3-3.4)	1.6 (.7-2.6)
	STD	4.2	1.4	2.1	3.0	3.8
	RMSE	4.5	1.5	2.3	4.2	4.1
	Pearson’s r (P-value)	.95 (<.001)	1.0 (<.001)	.99 (<.001)	.94 (<.001)	.85 (<.001)
Postop	ICC (95% CI)	.94 (.91-.96)	.99 (.98-.99)	.99 (.97-.99)	.92 (.83-.96)	.86 (.78-.91)
	Mean (95% CI)	−1.3 (−2.0–-.6)	−.5 (−.8-.1)	.9 (.6-1.1)	1.6 (1.1-2.1)	1.0 (.1-1.8)
	STD	4.0	2.0	1.5	2.8	3.5
	RMSE	4.2	2.0	1.8	3.2	3.6
	Pearson’s r (P-value)	.95 (<.001)	.99 (<.001)	.99 (<.001)	.94 (<.001)	.88 (<.001)

ICC-values ranged between .80 and 1.0 for preoperative images and between .85 and .99 for postoperative images. ICC-values were lowest for T1 slope (ICC_PreOP = .80, ICC_PostoOP = .90 for R2a vs R-AI) and highest for C2-C7 SVA (ICC_PreOP = 1.0, ICC_PostoOP = .99 for R1b vs R-AI). The RMSEs for preoperative images ranged between 4.1-4.8° for C2-C7 lordosis, 1.4-4.4 mm for C1-C7 SVA, 1.2-2.3 mm for C2-C7 SVA, 2.7-4.2° for C7 slope, and 3.5-4.1° for T1 slope. For postoperative images, RMSEs ranged between 4.2-5.0° for C2-C7 lordosis, 1.8-2.2 mm for C1-C7 SVA, 1.6-1.8 mm for C2-C7 SVA, 2.8-3.2° for C7 slope, and 3.1-3.7° for T1 slope. Figure 4 visualizes the correlations between measurements exemplarily for rater R2a and R-AI.

Figure 4.

Comparisons of manual measurements by physician 2a vs automatic measurements of the AI algorithm for 129 pre- and postoperative images, respectively.

Discussion

The aim of the present study was to create a new, fully automated algorithm to determine the parameters of cervical sagittal balance pre- and postoperatively. It was shown that an AI-based algorithm can systematically determine radiological parameters of sagittal cervical balance with excellent reliability and accuracy in comparison to measurements of experienced physicians, confirming the hypothesis. To the best of our knowledge, this is the first study to validate an automatic measurement tool for pre- and postoperative cervical spine X-ray images.

Excellent Agreement Between Manual Measurements

Apart from the validation of a novel automatic algorithm, the present study offers valuable insights into the agreement between expert human raters on measurements of sagittal balance parameters in cervical X-ray images. Two experienced physicians independently measured all images twice. Interestingly, the overall results of the intra- and interrater-reliability analysis resemble the comparison against the AI-algorithm: the sagittal vertical axes were measured with higher agreement than C2-C7 lordosis and the slopes of C7 or T1 vertebrae. The same was observed in the human vs human measurements comparison conducted on whole-spine radiographs by Yeh et al.¹⁵ Their results compare well with ours for C2-C7 SVA (present study: ICC_PreOP = .98-.99, ICC_PostOP = .99 vs ICC = .99-1.0 for the pooled pre- and postoperative evaluation of Yeh et al.¹⁵). Agreement between human raters was better for C2-C7 lordosis in the present study (ICC_PreOP = .96-.97, ICC_PostOP = .95-.96 vs ICC = .83-.88), and for T1 slope (ICC_PreOP = .91-.93, ICC_PostOP = .91-.92 vs ICC = .81-.86). Presumably, the restriction to the cervical spine region during imaging facilitated the measurement of cervical parameters by allowing higher resolution, leading to higher agreement between human raters in the present validation study. In Yeh et al.,¹⁵ the 3 mentioned cervical sagittal balance parameters were just a small subset of measured spinal parameters in whole-spine radiographs. Furthermore, their dataset partially included scoliotic patients, potentially making measurements even more challenging (Yeh et al 2021).

Excellent Agreement Between Manual and Automatic Measurements

After verifying the agreement of manual expert measurements by demonstrating the high agreement between the 2 human raters, we were able to assess the algorithm’s accuracy and reliability by comparing the automatic measurements against each set of manual measurements. Indeed, all 5 parameters could be measured with excellent agreement to manual measurements in pre- and postoperative images (all ICC-values ≥.8 for preoperative and ≥.86 for postoperative comparisons), with .75 defining the threshold for excellent reliability.²⁷ Even with a more conservative excellence threshold of >.9 ²⁴, ICC-values for C2-C7 lordosis, C1-C7 SVA, and C2-C7 SVA still meet the excellence criterion, while ICC-values for C7 slope (range .89-.95) and T1 slope (range .80-.94) are considered good to excellent.

To date, several fully automated algorithms have been proposed for the analysis of cervical X-ray images, but they do not measure parameters of the sagittal cervical balance. Amasya et al.²⁹ used an artificial neural network model to classify the maturation of cervical vertebra. Al Arif et al.³⁰ proposed a segmentation algorithm to detect vertebral bodies from cervical X-rays without computing any parameters. Although our proposed algorithm also makes use of a segmentation network, the detection and localization of vertebral bodies and implants is only the first step in the full pipeline of automatic parameter determination. Lecron et al.³¹ conducted a cervical spine mobility analysis, where they measured the angles between the midplanes of C3/C4, C4/C5, C5/C6 and C6/C7. They reported a detection rate of 89.8% and a mean RMSE of 3.4° between manual and fully automatic measurements of midplane angles. Although the low errors are promising and the measurement of angles between segments is useful, eg, for the assessment of postoperative reconstruction of the operated segments, the study did not measure central global parameters of the cervical sagittal balance such as C7 slope or the SVA .³ On the other hand, Yeh et al.¹⁵ measured 3 global cervical parameters – C2/C7 SVA, C2-C7 lordosis and T1 slope – as part of their analysis of pooled pre- and postoperative whole-spine X-ray images. Again, the results correspond well with the ones reported here, with our algorithm exceeding their performance for all parameters. In general, ICC-values were highest for C2-C7 SVA (present study: ICC_PreOP = .98-1.0 and ICC_PostOP = .99-1.0 vs ICC = .98 ¹⁵), and lower for C2-C7 lordosis (ICC_PreOP = .94-.95 and ICC_PostOP = .91-.94 vs ICC = .83) and T1 slope (ICC_PreOP = .80-.85 and ICC_PostOP = .86-.94 vs ICC = .77).

First-Time Validation on Preoperative vs Postoperative Images

Another novelty of the current study is the independent evaluation of the algorithm on pre- and postoperative images. Previous studies either did not report evaluation results for automatic measurements of spine parameters in postoperative images at all,^17,18 or the results were pooled with preoperative images,^10,13,15 potentially due to the challenges posed by implants. The presence of implants in postoperative images complicates both the automatic segmentation of individual vertebra³⁰ and the placement of landmarks. In comparable validation studies for parameters measured in full-body³² and lumbar spine X-ray images,³³ the performance on postoperative images was typically worse than for preoperative images. However, in the current study, we were able to mitigate the performance loss for postoperative images observed during initial development of the algorithm. This was achieved by adding images with a large variety of implant types to the training set and general data augmentation methods such as random rotation. After this optimization procedure, the algorithm works comparably well on preoperative and postoperative cervical spine X-rays.

Limitations and Outlook

In the present study, only degenerative cases that were surgically treated via an anterior approach were included. It is important to highlight that more complex implants and surgical techniques, eg, vertebral body replacements, long posterior instrumentations, or combined anterior-posterior surgical approaches, were not evaluated in the present patient cohort. Cases with trauma or destruction by tumor or infection or coronal deformities were not analyzed. Furthermore, the validation images included in this monocentric study originate from a single radiological department and the sample size was limited to 129 patients. Changes of the setup during image acquisition (eg, different X-ray machines, positioning of the patient during the X-ray) may influence image quality and therefore lead to different results, but on the other hand could increase the sample size. Furthermore, portable cross-table lateral X-rays should be evaluated additionally. Therefore, the algorithm should be validated in general on diverse patient cohorts from other clinical sites in future studies.

Furthermore, T1 slope can only be determined if the upper endplate of the first thoracic vertebra is visible and not obscured by the shoulders. The automatic detection of T1 is nevertheless challenging, reflected by the comparably low detection rate. Currently, T1 slopes could only be measured in approximately half of the validation images. In contrast, human raters were able to approximate the T1 slope in all validation images by changing contrast and brightness. Korez et al. ¹³ reported a similar problem regarding the occlusion of C7 by the shoulder, motivating the use of an imputation algorithm to determine C7 slope.

Apart from improving the detection rate of the T1 upper endplate, the algorithm could be extended to measure even more parameters of the cervical spine, eg, segmental angles or disk angles of the operated segments.

Potential Use Cases of the Algorithm

Based on the results, the presented algorithm allows a valid and reliable integration in clinical routine for preoperative planning and postoperative evaluation. The algorithm may be used for a first analysis, allowing the physician to adjust and correct the automatic measurements, if required. This could lead to time and cost savings and improved treatment planning in patients with cervical degeneration undergoing ACDF or CDA. The average time taken by the algorithm to analyze an image without any manual interference was 14 seconds. Considering that the medical raters typically required 3-5 minutes for the analyses of an image with all required cervical balance parameters, the algorithm demonstrated greater time efficiency.

Furthermore, the algorithm could autonomously analyze large datasets of X-ray images for research purposes. Given the high reliability and accuracy, the algorithm could be used without supervision of each case. The chance of measurement errors may be considered acceptable if the purpose of a study were to determine whether, eg, a difference between groups with different treatments exists, provided the sample size is large. Extreme outliers could also be avoided by a quick visual inspection of the proving image produced by the algorithm. The ability to analyze large-scale datasets could also help establish estimates of parameter-specific standard or average (ranges of) values in the healthy population or different age or disease groups.

Conclusion

This study proposes a novel algorithm based on state-of-the-art AI methods for the fully automated measurement of cervical sagittal profile parameters in pre- and postoperative cervical spine radiographs of patients suffering from degenerative spinal diseases. The algorithm was validated by comparing automatic measurements against manual measurements from 2 experienced physicians, demonstrating excellent reliability and accuracy, thereby confirming the initial hypothesis. This may allow the use in various scenarios, eg, a support of individual measurements conducted during clinical routine, or the autonomous analysis of large-scale datasets for research purposes. To further validate the algorithm on a significantly larger patient cohort, more investigations will be conducted as part of a future multicenter study.

Footnotes

Acknowledgments

The authors would like to thank Preston Melchert for his excellent editing and proofreading of the manuscript.

Declaration of Conflicting Interests

The author(s) declared the following potential conflicts of interest with respect to the research, authorship, and/or publication of this article: This research was a cooperation with Raylytic as part of the AIQNET initiative and was supported by the German Federal Ministry for Economic Affairs and Climate Action as a DLR project. This does not in any way affect the scientific output of this study.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by the German Federal Ministry for Economic Affairs and Climate Action (https://www.bmwi.de) as a DLR project with ID 01MK20003A. It is part of the AIQNET initiative (https://aiqnet.eu). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. JM is a former employee and PG, MD, and CS are current employees at RAYLYTIC GmbH ().

ORCID iDs

Sophia Vogt

Carolin Scholl

Ulf-Dietrich Braumann

Appendix

References

Hoy

Protani

Buchbinder

. The epidemiology of neck pain. Best Pract Res Clin Rheumatol. 2010;24(6):783-792.

Cohen

. Epidemiology, diagnosis, and treatment of neck pain. Mayo Clin Proc. 2015;90(2):284-299.

Ling

Chevillotte

Leglise

Thompson

Bouthors

Le Huec

. Which parameters are relevant in sagittal balance analysis of the cervical spine? A literature review. European Section of the Cervical Spine Research Society. 2018;27:8-15.

Lee

Hyun

Jain

. Cervical sagittal alignment: literature review and future directions. Neurospine. 2020;17(3):478-496.

Suda

Abumi

Ito

Shono

Kaneda

Fujiya

. Local kyphosis reduces surgical outcomes of expansive open-door laminoplasty for cervical spondylotic myelopathy. Spine. 2003;28(12):1258-1262.

Uchida

Nakajima

Sato

, et al. Cervical spondylotic myelopathy associated with kyphosis or sagittal sigmoid alignment: outcome after anterior or posterior decompression. J Neurosurg Spine. 2009;11(5):521-528.

Park

Kelly

Lee

Min

Rahman

Riew

. Sagittal alignment as a predictor of clinical adjacent segment pathology requiring surgery after anterior cervical arthrodesis. Spine J : Official Journal of the North American Spine Society. 2014;14(7):1228-1234.

Tang

Scheer

Smith

, et al. The impact of standing regional cervical sagittal alignment on outcomes in posterior cervical fusion surgery. Neurosurgery. 2012;71(3):662-669. ; discussion 669.

Papavero

Schmeiser

Kothe

, et al. Degenerative cervical myelopathy: a 7-letter coding system that supports decision-making for the surgical approach. Neurospine. 2020;17(1):164-171.

10.

Fujimori

Suzuki

Takenaka

, et al. Development of artificial intelligence for automated measurement of cervical lordosis on lateral radiographs. Sci Rep. 2022;12(1):15732.

11.

Chen

Zhai

Sun

Wang

Yang

. A narrative review of machine learning as promising revolution in clinical practice of scoliosis. Ann Transl Med. 2021;9(1):67.

12.

Galbusera

Casaroli

Bassani

. Artificial intelligence and machine learning in spine research. JOR spine. 2019;2(1):e1044.

13.

Korez

Putzier

Vrtovec

, 29. European spine journal : official publication of the European Spine Society, the European Spinal Deformity Society; 2020:2295-2305.A deep learning tool for fully automated measurements of sagittal spinopelvic balance from X-ray images: performance evaluationEuropean Section of the Cervical Spine Research Society9.

14.

Bailey

Rasoulinejad

. Automated comprehensive adolescent idiopathic scoliosis assessment using MVC-net. Med Image Anal. 2018;48:1-11.

15.

Yeh

Weng

Huang

Tsai

Yeh

. Deep learning approach for automatic landmark detection and alignment analysis in whole-spine lateral radiographs. Sci Rep. 2021;11(1):7618.

16.

Zhang

. Computer-aided cobb measurement based on automatic detection of vertebral slopes using deep neural network. Int J Biomed Imag. 2017;2017:9083916.

17.

Galbusera

Bassani

Costa

Brayda-Bruno

Zerbi

Wilke

H-J

. Artificial neural networks for the recognition of vertebral landmarks in the lumbar spine. Comput Methods Biomech Biomed Eng: Imaging & Visualization. 2018;6(4):447-452.

18.

Cho

Kaji

Cheung

, et al. Automated measurement of lumbar lordosis on radiographs using machine learning and computer vision. Global Spine J. 2020;10(5):611-618.

19.

Hyun

Kim

Jahng

Kim

. Relationship between T1 slope and cervical alignment following multilevel posterior cervical fusion surgery: impact of T1 slope minus cervical lordosis. Spine. 2016;41(7):E396-402.

20.

Gkioxari

Dollár

Girshick

. Proceedings of the IEEE International Conference on Computer Vision. 2017.

21.

Paszke

Gross

Massa

, et al. Pytorch: an imperative style, high-performance deep learning library. Adv Neural Inf Process Syst. 2019;32.

22.

Ronneberger

Fischer

Brox

. U-net: convolutional networks for biomedical image segmentation. In: Paper presented at: International Conference on Medical image computing and computer-assisted intervention. 2015.

23.

Abadi

Agarwal

Barham

, et al. Tensorflow: Large-Scale Machine Learning on Heterogeneous Distributed Systems. arXiv preprint arXiv:160304467 . 2016.

24.

Koo

. Cracking the code: providing insight into the fundamentals of research and evidence-based practice a guideline of selecting and reporting intraclass correlation coefficients for reliability research. Journal of Chiropractic Medicine. 2016;15(2):155-163.

25.

McGraw

Wong

. Forming inferences about some intraclass correlation coefficients. Psychol Methods. 1996;1(1):30.

26.

Shrout

Fleiss

. Intraclass correlations: uses in assessing rater reliability. Psychol Bull. 1979;86(2):420-428.

27.

Cicchetti

. Guidelines, criteria, and rules of thumb for evaluating normed and standardized assessment instruments in psychology. Psychol Assess. 1994;6(4):284.

28.

Van Rossum

Drake

. Python 3 Reference Manual Createspace. CA: Scotts Valley; 2009.

29.

Amasya

Cesur

Yildirim

Orhan

. Validation of cervical vertebral maturation stages: artificial intelligence vs human observer visual analysis. Am J Orthod Dentofacial Orthop : Official Publication of the American Association of Orthodontists, Its Constituent Societies, and the American Board of Orthodontics. 2020;158(6):e173-e179.

30.

Al Arif

Knapp

Slabaugh

. Fully automatic cervical vertebrae segmentation framework for X-ray images. Comput Methods Progr Biomed. 2018;157:95-111.

31.

Lecron

Benjelloun

Mahmoudi

. Cervical spine mobility analysis on radiographs: a fully automatic approach. Comput Med Imag Graph : The Official Journal of the Computerized Medical Imaging Society. 2012;36(8):634-642.

32.

Grover

Siebenwirth

Caspari

, et al. Can artificial intelligence support or even replace physicians in measuring sagittal balance? A validation study on preoperative and postoperative full spine images of 170 patients. Eur Spine J. 2022:1-9.

33.

Orosz

Haines

Thomson

, et al. Novel artificial intelligence algorithm can accurately and independently measure spinopelvic parameters. Spine J. 2021;21(9):S36-S37. 74.

Novel AI-Based Algorithm for the Automated Measurement of Cervical Sagittal Balance Parameters. A Validation Study on Pre- and Postoperative Radiographs of 129 Patients

Abstract

Study Design

Objectives

Methods

Results

Conclusion

Keywords

Introduction

Methods

Patient Population and Patient Selection

Image Acquisition and Measured Cervical Parameters

Manual Measurements by Physicians

Automated Measurements by the AI Algorithm

Preprocessing

Segmentation

Landmark Detection

Parameter Determination

Statistical Analysis

Results

Patient Cohort Statistics

Human vs Human Intra- and Inter-Rater Reliability

Human vs AI Inter-Rater Reliability

Discussion

Excellent Agreement Between Manual Measurements

Excellent Agreement Between Manual and Automatic Measurements

First-Time Validation on Preoperative vs Postoperative Images

Limitations and Outlook

Potential Use Cases of the Algorithm

Conclusion

Footnotes

Acknowledgments

Declaration of Conflicting Interests

Funding

ORCID iDs

Appendix

References