Abstract
Objective
Artificial intelligence offers opportunities for timesaving assessments of multiple pathologies in large magnetic resonance imaging (MRI) data sets in knee osteoarthritis (KOA). This study evaluated their prevalence within pre-defined clinical phenotypes and their predictive value for knee replacement (KR).
Design
Baseline MRIs (n = 8,667) from the Osteoarthritis Initiative were analyzed using a machine-learning (ML) algorithm. The presence of pathologies (menisci, anterior cruciate, medial collateral ligaments, cartilage, etc.) was assessed in previously identified phenotypic clusters (a post-traumatic, metabolic, and age-defined phenotype). The value of both, cluster allocation and joint pathology for KR prediction was evaluated using supervised ML models and time-dependent receiver operating characteristic curves.
Results
Compared to the population average, the metabolic cluster had a higher prevalence of cartilage lesions, while the post-traumatic one had more medial meniscal damage. Random forest models showed the best prediction (area under the curve 0.837, test set at 2 years). The top predictors for KR were meniscal position (relative to the border of the tibial plateau), severe joint effusion, medial femorotibial cartilage lesions, and metabolic phenotype. These features defined patients at high risk of KR with an estimated KR rate at 5 years of 10% vs 3% in the high- and low-risk groups based on a predictive risk score including all analyzed structures.
Conclusions
This ML-enabled assessment of multiple MRI pathologies in a large KOA data set highlights the importance of meniscal pathologies and markers of inflammation, in addition to cartilage assessments and clinical information for patient stratification and improved prediction of KOA progression to KR.
Introduction
Osteoarthritis (OA), especially of the knee (KOA), is considered a serious disease with high unmet medical need 1 based on the definition of “seriousness” in the Code of Federal regulations. 2 Despite this and decades of OA research and advances in understanding the underlying pathomechanisms, no disease-modifying treatment has yet been licensed. 3
Trials in KOA drug development have primarily focused on cartilage loss as key structure for inclusion and outcome assessment, using indirect measurements like joint space width on conventional x-rays or cartilage morphology on magnetic resonance imaging (MRI). Both imaging strategies are associated with challenges, including the susceptibility of radiographic joint space measures to positioning, confounding from meniscal position, and disregarding other structural pathologies in a whole-joint disease.4-6 The role of these other structures in the progression of KOA is well documented.7,8 Findings from MRI confirmed by histology have highlighted inflammatory changes, for example, in the knee and surrounding tissues 9 even at early stages.10,11
It is widely accepted that there are different KOA phenotypes,12,13 potentially associated with underlying endotypes. These depend on the input variables which explains differences reported in the literature. They include an inflammatory phenotype, based on the local expression of inflammatory markers and the clinical/imaging findings of inflammation.14,15 The observation of circulating inflammatory markers may also point to a metabolic phenotype characterized by comorbidities similarly observed in cardiovascular risk profiles.16,17 Patients suffering from high levels of pain, signs of central sensitization but relatively limited structural changes have been associated with a pain-phenotype, 18 while some patients experience very few symptoms.15,19 Finally, biomechanical and post-traumatic phenotypes are described.20,21 The links between phenotypes and endotypes remain unclear, but biomarker panels have suggested potential underlying mechanisms.22-24
We have previously identified different clinical phenotypes in the Osteoarthritis Initiative (OAI) database. Focusing on patients’ clinical baseline characteristics to facilitate the use of this phenotyping approach in practice, we primarily identified 3 different phenotypes, one potentially posttraumatic/biomechanic, one metabolic/inflammatory, and one with limited clinical symptoms (see “Methods” section for further details) 15 in line with other reports. 25 These clusters showed different trajectories of disease progression toward knee replacement (KR) over time, with the metabolic/inflammatory cluster exhibiting the highest risk. The phenotypes also showed differences in a single quantitative imaging biomarker, baseline bone shape (measured as B-score), linking clinical and structural changes in the context of KOA. Higher B-scores (reflecting 3D bone shape including femoral flattening and osteophytes) have been associated with disease progression and later joint replacement. 26 In our analysis, B-score added value (independent of improving cluster allocation) for the prediction of KR compared to clinical phenotypes alone. 27 Given this improvement of prediction from one structural biomarker, it would be relevant to consider the impact of multiple joint structural pathologies when evaluating populations with KOA, especially in the context of targeted drug development.
A recent review has evaluated the current literature on machine-learning (ML)-based prediction of OA progression. Despite increasing knowledge, the review identified certain gaps in current approaches relating to the non-standardized definition of progression and a trend to neglect complex data such as MRI data, accelerometry, or biomarkers.28-30
The increasing implementation of ML-based approaches in image analysis offers an opportunity to address these gaps. Algorithms have been developed to facilitate systematic analyses of multiple knee pathologies in large data sets.31,32 The current study therefore utilized a proprietary algorithm (KEROS V2.0.0; Incepto Medical—CE marking certification granted 09/2024)31,32 that was developed primarily as a support tool for clinical image-analysis reporting. We aimed to investigate the relationship between the previously identified clinical phenotypes and the presence of ML-detected MRI structural abnormalities in the knee and evaluate their predictive value for KR. We first hypothesized that specific pathologies may be more prevalent in specific phenotypes. The addition of detailed imaging information could thereby contribute to a more precise delineation of phenotypes. In addition, we hypothesized that combining imaging information and clinical phenotypes could improve the prediction of total or partial KR compared to a prediction based on clinical phenotypes alone. Improving the prediction of KR would be of great value, since it could support patients and providers in discussing and making informed treatment choices. Such predictions could also inform on the need for health care resources and facilitate the evaluation of benefits from non-KR approaches.
Methods
Clinical and Imaging Data
The study used data from the OAI, a multi-center, longitudinal, prospective observational cohort-study of KOA including 4,796 participants. The prospective data collection into the OAI was approved by the institutional review boards of the participating centers. 33 All patients gave informed consent to the data collection and secondary use. This analysis was approved by Ethikkommission Nordwest-und Zentralschweiz (Basec-No. 2023-01249).
Previous Clustering Analyses
In previous work, we proposed 2 clustering approaches using deep embedded clustering (DEC) and multiple factor analysis and clustering (MFAC) in the OAI database. The analysis identified distinct phenotypes of patients suffering from KOA based on 157 clinical baseline variables (see supplement for further details). 15 The DEC model used an auto-encoder for dimensionality reduction and a clustering layer for cluster identification. The MFAC used a weighted principal component and hierarchical clustering on the principal component for the cluster identification.
Both approaches depicted similar clusters:
- a cluster slightly younger than the average, with high levels of activity and low impact from pain (DEC [D1] and MFAC (M1));
- a second characterized by a high burden of comorbidities, pain, and disability (D3/M3);
- a third cluster older than the average, comparatively inactive and less afflicted by pain (D5/M2).
The DEC approach identified 2 additional clusters, D4 comparable to D1 but less active, D2 similar to D3 but presenting an exceptionally high rate of effusion. 15
On further analysis, these clusters demonstrated an association with differences in baseline B-score and trajectory toward KR with increased risk for patients in the “comorbid” clusters D3/M3 and for D2 with effusion. 27 Given the similarity between clusters D2/3 and M3, as well as D1/4 and M1, we focused on 3 groupings: a likely post-traumatic cluster (D1/M1), a comorbid cluster, potentially reflecting a metabolic phenotype (D2,3/M3) and an age-related cluster (D5/M2).
Image Analysis Algorithms
The software algorithms were developed using a large (n > 20 k) radiologist-annotated data set of knee MRI series (collected from 2009 to 2020) from 12 imaging centers to provide a categorical characterization of a range of structures/pathologies ( Table 1 , see Supplemental Table S1 for summary statistics of the underlying data set). Each specific pathology analysis pipeline was trained separately and (except for patellar height and trochlear depth measurements) relied on deep learning models that broadly consisted of 2 parts:
A set of convolutional neural networks (CNNs) to locate the target joint structure.
A subsequent set of CNNs to classify (from the previously identified location of the joint structure and based on the radiologists’ annotation) the target structure as “normal, doubtful, or abnormal.”
Overview of Evaluated Joint Structures Using KEROS V2.0.0 (Incepto Medical—CE Marking Certification Granted 09/2024).
Binary outcomes are normal and abnormal. Potentially abnormal results can be flagged by the system separately, to simplify the evaluation; they were in a conservative analytic approach evaluated as abnormal in this article.
Complex pathology includes abnormalities such as bucket handle or parrot-type lesions.
Position refers to the relative position of the outer meniscal border to the osseous border of the respective tibial plateau.
The label normal versus abnormal has been trained based on respective image analysis by radiologists from 12 centers on 20 k of data sets as described in section “Methods” of this article.
This ternary classification was a compromise between clinical utility, data constraints, and the need for reproducible results across a broad range of users. The algorithm was developed as a commercially available product providing diagnostic support in clinical practice. Accordingly, the classifiers “normal, doubtful, and abnormal” in KEROS V2.0.0 were designed to alert users in the clinical/diagnostic context to potential abnormalities, providing a high-sensitivity alert allowing clinicians to maintain control over the image interpretation in the clinical context and follow-up actions.
By condensing the assessment of structural severity to a ternary outcome label, the algorithm further aims at increasing reliability and robustness in detecting the presence of abnormalities, rather than forcing a distinction between subtle graduations; this is especially relevant, since during development, the distribution of cases across more granular levels of pathology was often significantly skewed in the underlying data set, which limited the ability to train a model with strong performance across multiple finer-grained categories.
Since the algorithms were trained on MRI series annotated by experienced musculoskeletal radiologists, we assumed the category “doubtful” to reflect a potential abnormality. The category “doubtful” was therefore included as abnormal in this analysis to simplify the evaluation.
The algorithms for KEROS V2.0.0 were validated using both proprietary data sets (collected by Incepto Medical) and publicly available data sets.31,32 The validation was based on standalone performance metrics of the algorithms including sensitivity, specificity, and area under the curve (AUC) compared to expert reading as the gold standard. The classification thresholds were calculated using a combination of empirical model calibration on a validation data set (with overlapping expert diagnoses) and clinical guidelines for identifying pathological features on MRI.
Clustering Approach
For this analysis, data from the incident and progression cohorts of the OAI data set 33 were utilized following previously described data analytic approaches using DEC and MFAC. 15 In addition to the baseline variables mentioned above, variables pertaining to the most common knee pathologies analyzed from MRI data (detailed in Table 1 , sagittal T2/PD-FATSAT acquisition) were included.
Outcome—Joint Replacement
Total KR (V99ELKDAYS, V99ERKDAYS in the OAI data set) or partial medial or lateral KR (V99ELKTLPR, V99ERKTLPR in the OAI data set) were employed as the outcome.
The time to the first KR event was defined as the time from the enrollment date to the first incidence of KR (in either knee). In the absence of an event during follow-up, the censoring date applied was the earliest of the following: date of death, date of withdrawal of informed consent, or date of last contact.
For patients having a unilateral KR, the baseline joint pathologies corresponding to that same side were selected. If the patient had no KR event or a KR event simultaneously on both sides, images from the knee with worse joint pathology at baseline were selected. This approach was selected to avoid potential collinearity in the regression models while maximizing the number of observable KRs as outcome events. The model aimed at a prediction of KR at the subject level and not at disease progression at the joint level.
Statistical Analysis
The statistical approach has been described previously. 27 Briefly, data were summarized using descriptive statistics (quantitative data) and contingency tables (qualitative data). Categorical data were presented as frequencies and percentages. For continuous data, mean (along with 95% CI), standard deviation, median, 25th and 75th percentiles, minimum, and maximum were computed.
Time to event (first KR) was presented descriptively using the Kaplan–Meier curves and summarized as the proportion of patients who were event-free at different time points (2, 5, and 8 years) along with the corresponding 95% CI.
Model Development and Evaluation
The outcome of the supervised predictive model was the null deviance residuals using a simple intercept Cox model with time to first KR. The population was divided into a training set of 80% and a test set of 20% applying random sampling, stratified by deviance residuals ( Fig. 1 ).

Flowchart of model development.
Continuous variables were standardized to have a mean of zero and a standard deviation of 1 using the training data set; categorical variables were dummy transformed.
Input baseline variables (clinical phenotypes and observed joint pathologies) were used to produce supervised predictive models for the deviance residuals using robust statistical model-based approaches, which included an elastic net (ENET), a random forest (RF), an extreme gradient boosting (XGBOOST), and a multilayer perceptron (MLP). The RF, XGBOOST (decision trees), and MLP (a type of neural network) were used due to their ability to learn and model nonlinear and complex relationships, whereas the ENET was selected for its capacity to tackle the issue of multicollinearity, which provides a balance between ridge and lasso regression. For each model, a cross-validation (CV) procedure was used to estimate prediction performance while also optimizing model hyperparameters using Bayesian optimization or simulated annealing methods. The hyperparameters were optimized based on the outer CV loop training set in the inner CV (repeated 5 times a 5-fold CV).
Global performance of the supervised models was determined by root mean square error (RMSE), time area under the receiver operating characteristic (ROC) curve and discrimination C-index metrics, and used to select the best predictive model. To estimate the 95% CI C-index for each model, 1,000 resampled iterations on the training and test data were performed.
Key features in the best-performing model were identified using variable importance by assessing the impact of RMSE through 100 permutations. SHapley Additive exPlanations (SHAP) values were used to explain and compare the outputs of the ML models.
The top individual features identified by the best-performing model were used to assess their predictive value using a univariable Cox proportional hazard regression model.
The final model was used to derive a composite continuous risk score value. To facilitate clinical interpretation, the risk score was used to categorize the population into 2 subgroups based on the upper quantile of predicted risk score of the final model in the training data and applied to the test set (i.e., lower 75% defined as low-risk, upper 25% defined as high-risk group). Predictive modeling was applied to these 2 risk groups, again using the training and test set.
To identify statistically significant differences between clusters or risk groups at baseline, nonparametric Wilcoxon-Mann-Whitney or Kruskal-Wallis rank sum and chi-squared contingency table tests were performed for continuous and categorical variables, respectively. All P-values are nominal, and no multiplicity adjustments were performed.
Handling of Missing Data
Imputation of missing baseline information on the presence of MRI pathology was performed. A benchmark of missing imputation algorithms (random imputation, k-nearest neighbors [kNN], missing values with multivariate data analysis [missMDA], multiple imputation with denoising autoencoders [MIDAS], RF, and multiple imputation by chained equations [MICE]) was performed under different missing pattern assumptions: missing at random (MAR), missing completely at random (MCAR), and missing not at random (MNAR). A total of 100 simulations were performed, and the best imputation algorithm corresponding to the lowest RMSE was selected to impute the missing values.
Software
All statistical computations were performed in R version 4.1.0 (2021-05-18), R Core Team (2021) using RStudio version 2022.07.3+585.pro1 environment RStudio Team (2021).
Results
Prevalence of Magnetic Resonance Imaging-Detected Pathologies
Tables 2
Summary of Joint Pathologies at Baseline Across MFAC Clusters.
M1 refers to the first cluster identified using MFAC characterized by younger average age, high levels of activity, and relatively low pain levels.
M2 refers to the second cluster identified using MFAC characterized by an older average age, low levels of activity, and relatively low pain levels.
M3 refers to the third cluster identified using MFAC characterized by a high burden of comorbidity, depression, pain, and disability.
3 mm is an established cut-off for medial meniscal extrusion; there is none for the lateral meniscus. Given the distribution of values, we set the cut off at −1 mm, i.e., centralization of the meniscus.
Kruskal-Wallis rank sum test.
Pearson’s chi-squared test.
Summary of Joint Pathologies at Baseline Across DEC Clusters.
D1 refers to the first cluster identified using DEC characterized by younger average age, predominance of the male sex, high levels of activity, and relatively low pain levels.
D2 refers to the second identified using DEC characterized by a high burden of comorbidity, pain, and presence of effusion.
D3 refers to the third cluster identified using DEC characterized by a high burden of comorbidity, depression, pain, and disability.
D4 refers to the fourth cluster identified using DEC similar in characteristics to D1 but with lower levels of activity.
D5 refers to the second cluster identified using DEC characterized by an older average age, low levels of activity, and relatively low pain levels.
3mm is an established cut-off for medial meniscal extrusion; there is none for the lateral meniscus. Given the distribution of values, we set the cut-off at −1 mm, i.e., centralization of the meniscus.
Kruskal-Wallis rank sum test.
Pearson’s chi-squared test.
Imputation, Model Tuning, and Performances
The results of the 100 imputations under MCAR, MAR, and MNAR missing patterns are presented in Supplemental Table S2 and Supplemental Figure S2. The mean RMSE values for the missMDA and RF imputation algorithms were comparable and lower than the kNN, MIDAS, Mice or using a random value type of imputation. All missing data were subsequently imputed with missMDA.
Hyperparameters search during the CV-procedure is presented in Supplemental Figure S3.
The RMSE for the simulated annealing and Bayesian optimization were comparable, and the median RMSE for RF and XGBOOST models were lower than MLP and ENET models. The best tuning parameters of each model were then used to predict the outcome for the training set and test set. The AUC over time in the test set and Harrel’s C-index were used to compare the performance of the ML models.
The RF, ENET, XGBOOST, and MLP models performed similarly when comparing the C-index with 95% CI. In addition, all ML models performed better than a model solely based on cluster allocation. Cluster allocation in this context refers to the use of an algorithm solely based on clinical information, in which dimension 1 is driven by variables associated with disease perception such as PRO information, while dimension 2 relates to the clinical picture with knee examination, physical activity, and anthropometrics. For the ML models, combining the cluster allocation with the imaging information of lateral or medial meniscal position and effusion were the most relevant features ( Fig. 2 and Table 4 ). We observed a decrease in performance based on the test set ( Fig. 3 and Table 4 ).

Comparison of the predictive performance across machine-learning models in the test set.
Discrimination Measures.

Kaplan-Meier analysis of joint replacement by predicted risk group.
The RF was chosen for further analyses based on its performance in the test set ( Fig. 2 ). At 2 years, the AUC for the RF model was 0.837 in the test set ( Table 4 and Supplemental Figure S4).
Baseline Characteristics Associated With Joint Replacement
The most impactful features for the prediction of KR events using the RF model (Supplemental Figures S5 and S6) were the meniscal position (lateral and medial), the presence of severe joint effusion, abnormalities of the medial femorotibial cartilage, and allocation to cluster M3 (the model was not run in DEC, given the previously observed similarity of results). These most important features in the RF were also the top features in the MLP, ENET, and XGBOOST models (Supplemental Figure S7).
Risk Groups
Patients were classified into low- and high-risk groups based on the predictive risk score using the RF model. In the training and test sets, a clear separation in Kaplan-Meier estimates for joint survival between the risk groups was observed ( Fig. 3 ). In the high-risk group, there were 37.4% and 18% KR events for training and test sets, respectively, whereas there were 0.4% and 5.9% KR events for the low-risk group in the respective training and test sets.
Figure 4 and Table 5 summarize the distribution of joint pathologies in the different risk groups. Severe effusion, meniscal pathology, cartilage pathology, and cluster allocation to cluster M3 were the most discriminative features (all P-values < 0.001).

Distribution of joint pathologies per risk group: (
Distribution of Cluster Allocation and Joint Pathologies Per Risk Group.
The percentages for the clusters refer to the proportion of patients attributed to the low- versus high-risk group per phenotypic cluster.
Pearson’s chi-squared test.
Kruskal-Wallis rank sum test.
Discussion
In this study, we incorporated novel categorical ML-derived image analytics of joint pathologies from a large KOA data set into ML-based predictive algorithms, to determine their distribution in different pre-defined phenotypic clusters and assess their importance in addition to these clusters in the prediction of KR. Our results suggest different patterns of pathologies for the different phenotypes and an additive predictive value for certain MRI pathologies.
Unlike previous studies, we did not use the actual (raw) images as input for modeling but results from ML-based image analysis. This is also the first study to explicitly use an ML-based evaluation of ligaments and BMLs. The use of results after image analysis as input may be associated with a loss of information compared to using all imaging data 34 but facilitates the interpretation of predictive algorithms by mimicking clinical reasoning approaches.
Other groups have also evaluated predictive models for KR. Some models use conventional x-ray as input variable which can improve the prediction compared to models based solely on clinical variables.33-36 The use of conventional x-ray as imaging input can, however, also introduce bias and limits generalizability due to the impact from positioning and reader-variability on the interpretation of images. 37 Most models including x-ray and MRI input show superiority for a combination of both imaging modalities or MRI over x-ray alone.38,39 Therefore, a number of groups have included MRI information in prediction models.38-42 The results are highly dependent on the exact input variables. While some authors38,40 describe a better predictive performance when including intra-articular tissue pathologies, Tolpadi et al. 35 reported improved prediction from periarticular tissues, arguing that indication for KR may not reflect true structural progression of OA. Apart from the importance of proper validation of any prognostic or predictive algorithm, it is clear that contributing variables need to be critically reviewed, and both conceptual frameworks and algorithms need to remain flexible to onboard novel insights or biomarkers based on ongoing research.
Publications based on raw images as input report that the highest impact on prediction is from intra-articular areas classically associated with OA progression, such as the cartilage thickness or the cartilage-bone interface.35,36 In this study, the impact of lateral or medial meniscal position (relative position of the outer meniscal border to the osseous rim of the tibial plateau), joint effusion, medial cartilage abnormality, or cluster allocation was higher than the impact from other features. For some pathologies, however, the frequency was very low in this data set, so the evaluation of their predictive value was likely underestimated. One such example was ACL pathology, whose predictive value for later KR was marginal in the present analysis, while multiple previous studies have demonstrated ACL injury to be a major risk factor for the development and progression of KOA.21,37 The inclusion of ligament in predictive algorithms may increase their utility in younger populations. Given the low prevalence, especially of ligament pathologies in the OAI data set, the evaluation of additional data sets is required to substantiate this assumption.
Similar to previous observations on bone shape,27,34 the results show an added value if clinical features and structural joint information are combined for predicting KR. The ranking of the individual joint pathologies in this context confirms our underlying assumption that not only cartilage degeneration is predictive of KR but also that KOA is a true whole-joint disease. It is notable that soft tissue pathologies (medial or lateral meniscal position and severe joint effusion) were the top 3 predictors for KR. This underlines the considerable limitations of x-ray for detecting changes relevant for the evaluation of KOA progression.
The previously described clusters largely differentiate a potentially post-traumatic cluster (D1/4, M1), a cluster of patients exhibiting a comorbidity-driven phenotype with potential underlying systemic and/or local inflammation (D2/3, M3), as well as a cluster with limited disease impact (D5, M2). In line with our interpretation of cluster D1/M1, the prevalence of ACL abnormalities was the highest in these patients. The association between the higher prevalence of ACL abnormalities in the high-risk group compared to the low-risk group also suggests the relevance of biomechanical aspects even if ACL abnormality in general did not feature as highly predictive. Ligamentous abnormalities may show a higher predictive value in younger patient cohorts or real-world evidence data. Similarly, in line with our previous analyses, the data support the notion of an inflammatory phenotype within the comorbidity-driven phenotype cluster, as suggested by the prevalence of severe effusion, which is highly prevalent in the high-risk group.
Meniscal extrusion is a well-described risk factor for progression of KOA38,39 and present in a relevant proportion of patients, especially in cluster M3. Surprisingly, the lateral meniscal position had a higher predictive impact on later KR than the medial one. In addition, the predictive value of lateral meniscal position for later KR refers to negative values (i.e., centralization of the meniscus relative to the osseous border), whereas medially, meniscal extrusion (with positive values) was observed as risk factor for KR. For the medial meniscus, extrusion has been described under load bearing, 40 and it seems likely that with meniscal pathology or structural changes, this extrusion becomes permanent, reflecting a loss of meniscal function and facilitating OA structural progression. 41 The lateral meniscus is anatomically more mobile and shows a slight natural centralization relative to the rim of the lateral tibial plateau. This observation of a trend for smaller, even negative values (reflecting a central position of the lateral meniscus relative to the osseous rim) to be associated with a risk for KR is unexpected. A trend toward a centralization of the lateral meniscus with increasing Kellgren-Lawrence grade has, however, been observed in a small cohort previously. 40 This may be related to an increase in bone shape rather than an actual change in lateral meniscal positioning. Employing saliency maps to identify areas that drive prediction, Rajamohan et al. 36 also observed the impact on prediction of total KR from peripheral bone cartilage interface, representing pathologic features like osteophytes rather than meniscal extrusion.
As shown in our analysis and previously reported, cartilage pathology is a major risk factor for KR. 42 Eckstein et al.43,44 described loss in cartilage thickness predominantly in the central and total medial tibiofemoral compartment but also an overall cartilage thinning as predictive of later KR. Similarly, Raynauld et al. 45 describe a predictive value of a ≥7% loss in cartilage volume at 1-year follow-up. Typically, cartilage data are derived from 3D segmentation of the cartilage, providing insights about the volume or thickness of cartilage. One challenge if using such volumetric data is the lack of normal values, e.g., for a specific height, sex, and age, which makes it difficult to appreciate the exact extent of degeneration at baseline. In this study, the input was limited to baseline categorization of cartilage as normal or abnormal which was still sufficient for KR prediction, perhaps because the lack of accurate quantification of cartilage thickness was offset by other features such as meniscal position or effusion, features that are often associated with cartilage damage.
In the above analyses, BMLs were not as prominent for prediction of KR as suggested by previous reports.45-47 This discrepancy may be explained by the dichotomous assessments and the cross-sectional design of the present study. Severe effusion based on MRI at baseline has been shown to be predictive for later KR in this population. Other groups have similarly described the predictive value of MRI-detected effusion over time.48,49 In previous analyses, clinical effusion was one of the differentiating factors for cluster allocation, suggesting the potential presence of an inflammatory pheno-/and endotype. 15 The potentially fluctuating nature of effusion may make this variable difficult to validate in a non-selected population.
This study was predominantly focused on improving risk prediction supporting an enrichment strategy for trials that require OA progression to demonstrate a treatment benefit. In this study the high-risk group (i.e., the upper quartile of risk score) had a 3-fold higher risk of KR within 2 years (test set). Enriching for this population could allow a reduction in sample sized and/or trial duration. The use of individual predictions (e.g., relying on the population average or using digital twin approaches) could be leveraged in order to develop surrogate endpoints, thereby reducing trial durations, participant burden, and development costs for innovative treatments. An alternative use case could be prediction in a clinical setting to support shared decision-making, helping patients and providers to choose the most appropriate treatment options based on the projected time course of the disease, modifiable risk factors, and individual expectations. It is possible that not all input variables used in these algorithms are available in a given clinical setting or trial. One could, however, envision the development of a risk stratification calculator reporting different degrees of predictive accuracy, depending on the availability of input variables. For patients, having a reliable indication of the timeframe of disease progression may support informed decision-making around lifestyle changes, acceptance of treatment approaches, and private or professional life choices. For health care providers, the knowledge of an approximate time course of the disease may support efficient allocation of health care resources, optimizing the time point of surgery prior to behavioral changes in activity patterns and pain-associated sarcopenia, which may both negatively affect surgical outcomes. Finally, being able to predict the time frame to KR may help optimize the allocation of health care resources and planning of public health expenditure. As mentioned above, being able to predict progression to KR would allow a better effectiveness evaluation of treatments for KOA based on their ability to delay KR.
The study has some limitations. There are potential biases arising from the development of KEROS V2.0.0. Although the size of the underlying data set as well as its multi-institutional origin speak for the generalizability of the algorithms’ performance, there is an overrepresentation of certain manufacturers and a predominance of 1T machines. The performance of the algorithms needs to be validated across a larger data set with more representation of different vendor magnets and field strengths. The imaging findings were included as limited (binary) categories in the predictive models, and more detailed descriptions or measurement of pathologies may be useful to improve predictive value. Joint replacement is a complex endpoint, encompassing patient, surgeon, and health system variables. This endpoint has been chosen based on the necessity to demonstrate a benefit on how patients “feel, function and their joints survive” 50 in order to claim a treatment for OA. Pain would have been a valid alternative, is however difficult to fully appraise from registry data. Registries collect typically 1 to 2 assessments per year. In such a setting, data on pain are highly liable to chance and thereby reversibility, rendering joint replacement the most definite endpoint for prediction. We have relied on baseline data, in order to mimic a potential clinical trial setting, in which patient selection does not depend on longitudinal data. Longitudinal data would be expected to improve predictive performance, and further research could refine the current approach. Such a strategy could improve the basis for shared decision-making over time, including the impact and assessing the benefit of therapeutic interventions with regard to delaying KR. In addition, other types of ML-algorithms and/or a stacking approach could be investigated to compare the predictive performance. Although the OAI has been extensively explored and shaped the perception of KOA. Not all pathologies or patient characteristics may be represented as in the overall population, given the study setting, the long follow-up time, and age restriction. Our findings require external validation in other large longitudinal data sets or real-world evidence data, especially in view of the low prevalence of certain pathologies such as ACL-rupture 51 and the associated risk for KOA.21,37 Another reason to insist on validation in an external or independent data set is the risk of overfitting. In this study, we have observed distinct difference between the model performance in the training versus the test set, suggesting that the model fine-tuned on the training set was overly optimistic. Validation is a prerequisite to verify the model performance and ensure its generalizability. Opportunities for research are, however, limited by the limited availability of large data sets and the lack of consistent collection of input variables between data sets, registries, and real-world evidence data sources, hindering such research activities.
The new insights gathered in this analysis are relevant for OA drug development but potentially also clinical and public health decision-making. The development of predictive models for knee replacement could facilitate the enrichment of trials for patients at risk, in order to evaluate joint survival in a treatment versus placebo group, which would present a regulatory and health-economically relevant endpoint. In such a scenario, the complementary use of various clinical and imaging variables, depending on their predictive value and availability, would be preferable. For clinical decision-making, predictive models could support shared decision-making and guide the choice of treatment escalation options.
Supplemental Material
sj-docx-1-car-10.1177_19476035251395177 – Supplemental material for Combining Machine-Learning Assessment of Multiple MRI Pathologies and Clinical Phenotypes for Predicting Joint Replacement in Knee Osteoarthritis: Data From the Osteoarthritis Initiative
Supplemental material, sj-docx-1-car-10.1177_19476035251395177 for Combining Machine-Learning Assessment of Multiple MRI Pathologies and Clinical Phenotypes for Predicting Joint Replacement in Knee Osteoarthritis: Data From the Osteoarthritis Initiative by G. D’Assignies, D. Demanse, F. Saxer, D. Laurent, P. Zille, T. Vesoul, P. Cordelle, G. Herpe, P.G. Conaghan and M. Schieker in CARTILAGE
Supplemental Material
sj-docx-2-car-10.1177_19476035251395177 – Supplemental material for Combining Machine-Learning Assessment of Multiple MRI Pathologies and Clinical Phenotypes for Predicting Joint Replacement in Knee Osteoarthritis: Data From the Osteoarthritis Initiative
Supplemental material, sj-docx-2-car-10.1177_19476035251395177 for Combining Machine-Learning Assessment of Multiple MRI Pathologies and Clinical Phenotypes for Predicting Joint Replacement in Knee Osteoarthritis: Data From the Osteoarthritis Initiative by G. D’Assignies, D. Demanse, F. Saxer, D. Laurent, P. Zille, T. Vesoul, P. Cordelle, G. Herpe, P.G. Conaghan and M. Schieker in CARTILAGE
Footnotes
Acknowledgements
The authors like to thank the participants, investigators, and funders of the OAI database, a public-private partnership comprising 5 contracts (N01-AR-2–2258; N01-AR-2–2259; N01-AR-2–2260; N01-AR-2–2261; N01-AR-2–2262) funded by the NIH and conducted by the OAI Study Investigators. Data and/or research tools used in the preparation of this manuscript were obtained and analyzed from the controlled access data sets distributed from the Osteoarthritis Initiative (OAI), a data repository housed within the NIMH Data Archive (NDA). P.G.C. is funded in part by the National Institute for Health and Care Research (NIHR) Leeds Biomedical Research Centre (BRC) (NIHR203331). The views expressed are those of the authors and not necessarily those of the NHS, the NIHR, or the Department of Health and Social Care.
Ethical Considerations
The protocol of the OAI states the obtention of informed consent prior to any study-associated activities. This analysis was approved by Ethikkommission Nordwest-und Zentralschweiz (Basec-No. 2023-01249).
Author Contributions
All authors have been involved in the conception and design of the study; Incepto has provided the image analysis; the analysis of the data was driven by D.D.; and all authors contributed to the interpretation of data. F.S. and D.D. primarily drafted the manuscript, which was critically reviewed and approved by all authors.
Funding
The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: The analysis was funded (via protected time) by Novartis Biomedical Research (BASICHR0042) and Incepto (image analysis) under a research collaboration agreement with Incepto, who shared the results of MRI analysis in the OAI. The funder had no influence on the study design, data interpretation, or publication strategy.
Declaration of Conflicting Interests
The authors declared the following potential conflicts of interest with respect to the research, authorship, and/or publication of this article: Several authors are employees or shareholders of the sponsor organizations; the data analysis followed a pre-specified protocol to minimize potential bias. G.D. is the founder and shareholder of Incepto Medical. D.D. is the employee and shareholder of Novartis. F.S. is the employee and shareholder of Novartis; she is affiliated with the University of Basel and a member of the European Union Medical Devices—Expert Panel Section of Orthopaedics, Traumatology, Rehabilitation, and Rheumatology. D.L. is the employee and shareholder of Novartis. P.Z. is the employee and shareholder of Incepto Medical. T.V. is the employee and shareholder of Incepto medical. P.C. is the employee and shareholder of Incepto Medical. G.H. is the employee and shareholder of Incepto Medical. P.G.C. has done speakers bureaus or consultancies for AbbVie, AstraZeneca, Diffusion, Eli Lilly, Galapagos, Genascence, GlaxoSmithKline, Grunenthal, Janssen, Levicept, Novartis, Pacira, Regeneron, Sandoz, Stryker, and Takeda. M.S. is the employee and shareholder of Novartis; he is the owner of LivImplant GmbH and affiliated as a lecturer to ETH Zürich.
Data Availability Statement
The clinical data, PROs, and additional analyses are publicly available after registration from https://nda.nih.gov/oai/ (accessed October 18, 2022). The source code for DEC is available from
(accessed October 24, 2022) adapted to facilitate working with the Keras Package instead of the Caffe Package as reported by Xie et al., and for MFA, refer to Le et al. The underlying data are available from the OAI database.
Declaration of AI and AI-Assisted Technologies in the Writing Process
No use of AI and AI-assisted technologies in the writing process.
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
