Abstract
Knee osteoarthritis (KOA) remains the most prevalent form of osteoarthritis and a major cause of global disability. The Kellgren–Lawrence (KL) grading system, though widely used, suffers from inter- and intra-observer variability, especially in early disease stages. Artificial intelligence (AI) offers a transformative approach to automate KL grading on plain radiographs, providing consistent, reproducible, and scalable diagnostic solutions. This narrative review synthesizes recent advances in AI-based KL grading models, focusing on methodological frameworks, performance, clinical applicability, and limitations. Narrative review of peer-reviewed studies applying AI-based methods for KL grading of KOA on radiographic images. Literature search was conducted across PubMed, Embase, Web of Science, and Google Scholar to identify studies published between 2016 and 2025. Eligible studies satisfied predefined selection criteria, applied AI-based methods to radiographic grading of KOA. The review focused on model architectures, dataset characteristics, validation strategies, performance metrics, and comparisons with expert radiographic assessment. Eighteen eligible studies were included. Convolutional neural networks (CNN) remain the core of automated KL grading, evolving from standard classification models to ensemble and ordinal regression frameworks. Model performance was evaluated against expert-assigned KL grades as reference standard, with reported accuracies ranging from 75% to 98% and area under the curve values up to 0.98. Agreement with expert annotations, Cohen’s kappa (κ), ranged from 0.67 to 0.86. Deep Siamese networks, Faster R-CNNs, and ensemble frameworks have enhanced localization of KOA radiographic features, thereby interpretability relative to human radiologic assessment. Ordinal regression and attention-based visualization (saliency and class activation mappings) reduced misclassification between adjacent KL grades. Persistent challenges included subjective ground-truth labeling, dataset imbalance particularly under-representation of early (KL 0–1) and severe (KL 4) disease, and limited external validation. Models trained primarily on Osteoarthritis Initiative and Multicenter Osteoarthritis Study datasets showed reduced generalizability on external hospital datasets. AI-driven KL grading demonstrates near-human accuracy and strong promise for clinical integration. However, addressing labeling subjectivity, dataset diversity, and explainability remains essential for trustworthy deployment. While KL grading is inherently radiograph-based, integration of clinical metadata and longitudinal radiographic data may support more robust disease characterization. Federated learning frameworks offer a pathway to improve generalizability while preserving data privacy.
Plain language summary
Keywords
Introduction
Knee osteoarthritis (KOA) is the most common manifestation of osteoarthritis.1,2 It is a chronic, progressive musculoskeletal orthopedic condition presented with stiffness, swelling, and pain in the affected knee joint. 3 It represents a major contributor to global disability and healthcare burden. Recent estimates suggest that by 2021, there were approximately 606.5 million people living with osteoarthritis, 4 and KOA accounts for 60%–85% of all osteoarthritis (OA) cases,1,2 making it the most frequently affected joint. By 2020, KOA affected an estimated 4.9% of the global population (4711 cases per 100,000), with prevalence notably higher in women (6.0%) than in men (3.8%). 5 The global incidence rate was approximately 381 per 100,000 people annually, and KOA accounts for an estimated loss of 149 disability-adjusted life years per 100,000, underlining its burden on quality of life, mobility, and independence. 5 The burden of KOA continues to rise in parallel with demographic and lifestyle shifts. Between 1990 and 2020, the global number of KOA cases increased by over 130%, and future projections estimate a further 75% increase by 2050. 6 Importantly, KOA is no longer confined to older adults. Data from the Global Burden of Disease Study revealed that in 2019, more than half (52.3%) of new OA cases occurred in individuals under the age of 55, reflecting an alarming trend of early-onset disease. 7 This increase has been strongly associated with modifiable risk factors such as high body mass index (BMI), which accounted for 15.3% of early-onset OA cases in 2019, up from 9.4% in 1990. 7
KOA poses significant challenges for healthcare systems and patients alike. The disease leads to pain, stiffness, functional decline, and eventually long-term disability, often requiring joint replacement surgery. With aging populations and rising obesity rates, KOA is projected to remain a leading cause of disability worldwide, underscoring the need for improved diagnostic tools, early detection strategies, and interventions to slow disease progression. Conventionally, KOA is diagnosed using symptoms and imaging methods. The common first line of diagnostics is plain radiographs, due to their reduced cost and wider availability.3,8,9 The radiographs are looked for joint space width (JSW) narrowing, osteophyte growth, subchondral sclerosis, cysts, and progressive bone contour changes to confirm the diagnosis.10,11 Accurate radiographic assessment remains central to the diagnosis and monitoring of KOA, with the Kellgren–Lawrence (KL) grading system and the Osteoarthritis Research Society International (OARSI) atlas being the most widely adopted standards. KL grading, a semi-quantitative scoring method, has a long history back to 1957, and is highly used as a clinical research tool in epidemiological studies, providing 0–4 grading scores based on the JSN, marginal osteophyte formation, subchondral sclerosis, subchondral cysts and alterations in femoral condyles and tibial plateau contours.8,9,11–13 However, conventional KL grading is subjective and prone to inter-observer variability. Recent advances in artificial intelligence (AI) and machine learning (ML) have opened opportunities to automate KL scoring on plain radiographs, offering greater consistency, efficiency, and potential for early detection. Exploring these applications is critical to understanding how AI can support clinicians in improving KOA diagnosis and management. This study was intentionally designed as a narrative review rather than a systematic review or meta-analysis. While a recent high-quality systematic review by Mohammadi et al. (2024) comprehensively quantified diagnostic accuracy metrics for AI-based KL grading, the rapidly evolving and methodologically heterogeneous nature of deep learning (DL) approaches in this field limits the interpretability of pooled estimates. 14 A narrative framework was therefore adopted to enable a broader, conceptually driven synthesis of model architectures, grading paradigms, dataset characteristics, labeling variability, validation strategies, and interpretability considerations that extend beyond quantitative performance metrics. This approach allows critical appraisal of methodological design choices and emerging research directions that are not readily captured through formal meta-analysis.
Traditional KL grading and its limitations
The KL grading system, introduced in 1957, was developed to address the inconsistency clinicians faced when interpreting radiographic features of osteoarthritis. Early investigations by KL on coal miners revealed substantial disagreement both between different observers and within the same observer at various times when evaluating radiographic changes. 15 To improve reproducibility, they proposed a five-grade classification supported by reference images, which has since become the most widely used tool for defining disease severity in KOA.8,13 In their original work, KL demonstrated that reliability of radiographic grading varied substantially across joints, with interobserver correlation coefficients ranging from as low as 0.10 in the wrist to as high as 0.83 in the knee, while intraobserver correlations showed a similar spread, from 0.42 in the dorsolumbar spine up to 0.88 in the metacarpophalangeal joint; notably, the knee exhibited one of the strongest agreements (r = 0.83 for both inter- and intraobserver reliability).8,12 The KL system has been extensively applied in both clinical and research settings, ranging from stratifying participants in epidemiological studies to guiding surgical decisions and monitoring disease progression, owing to its simplicity, global acceptance, and ability to summarize complex radiographic findings into standardized categories. 8 Table 1 presents the KL-grading system with description and radiographic features for KOA.
In most epidemiological and clinical studies, a KL grade ⩾2 is considered the reference standard threshold for the presence of radiographic knee osteoarthritis.
KL, Kellgren–Lawrence; OA, osteoarthritis.
Despite these strengths, a major limitation of KL grading lies in its susceptibility to inter- and intraobserver variability. Studies have shown that agreement between clinicians is often only moderate, particularly in borderline or early disease stages, while repeatability by the same observer may also vary significantly. Such variability can lead to inconsistent disease classification, potentially affecting clinical decision-making and complicating the design and interpretation of research studies. Table 2 presents the inter- and intra-observer variability of KL grading mentioned in various studies.
Summary of studies presenting inter- and intra-observer variability of KL grading.
ACL, Anterior Cruciate Ligament; AP, Anteroposterior; JSN’ Joint Space Narrowing; KL, Kellgren–Lawrence; MARS, multicenter ACL revision study; OA, osteoarthritis.
Its ordinal nature often oversimplifies the complex spectrum of structural changes in KOA, and subtle pathological features may go undetected. In addition to this, KL grading is often described with different alternative descriptions for each class, especially for Grade 2, cut-off to determine knee OA. The original KL grade 2 definition is “definite osteophytes and possible narrowing of joint space.” However, subsequent studies have adopted divergent interpretations. For example, Jordan et al. 22 classified radiographs demonstrating definite osteophytes without any joint space narrowing as KL grade 2, thereby emphasizing osteophyte presence alone as sufficient for diagnosis. In contrast, Hart et al. 23 defined KL grade 2 as minimal but definite small osteophytes accompanied by minimal joint space narrowing, thereby requiring early structural narrowing in addition to osteophyte formation. Williams et al. 24 further broadened the definition of KL grade 2 as minimal osteophytes with possible joint space narrowing, additionally incorporating features such as cyst formation and subchondral sclerosis. These alternative descriptions ranging from preserved joint space to inclusion of additional bony changes reflect a lack of consensus regarding the minimal radiographic criteria required for KOA diagnosis at Grade 2. This may create confusion in assigning grades and impact the clinical diagnosis. Furthermore, the process is time-intensive, requiring expert input, which limits scalability in large research cohorts and slows down clinical trial recruitment.
Because of these limitations, there is a strong case for developing methods that reduce subjectivity, improve sensitivity (especially in early disease), and allow standardized, reproducible, and perhaps automated assessments. The results of the presented studies have spurred growing interest in AI-based solutions, which promise to overcome the subjectivity and inefficiencies inherent to traditional KL grading by offering greater accuracy, reproducibility, and scalability.
Methodology
Search strategy
A comprehensive literature search was conducted across PubMed, Embase, Web of Science, and Google Scholar to identify relevant studies on AI applications for KL grading of KOA using plain radiographs. The search covered all publications from January 2016 to September 2025. Search terms included combinations of: (“Kellgren-Lawrence” OR “KL grading”) AND (“artificial intelligence” OR “deep learning” OR “machine learning”) AND (“knee osteoarthritis” OR “KOA”) AND (“radiograph” OR “X-ray”).
Selection criteria
Studies were included if they:
Applied AI, ML, or DL models for KL grading on plain knee radiographs.
Used any one or more standard knee radiographic views (anteroposterior, posteroanterior, fixed-flexion, or equivalent).
Reported quantitative diagnostic or comparative performance metrics (e.g., accuracy, area under the curve (AUC), κ).
Provided sufficient methodological detail on dataset composition or preprocessing.
Studies were excluded if they:
Conference abstracts, editorials, narrative reviews, or systematic reviews.
Animal-based or in vitro studies.
Imaging modalities other than plain radiographs (e.g., MRI, CT, ultrasound).
Studies focusing primarily on other scoring systems (e.g., OARSI atlas).
Study selection
The initial search identified 1386 records (PubMed: 276, Embase: 81, Web of Science: 440, Scopus: 246, Google Scholar: 1110). After removing duplicates, 1215 unique articles remained for title and abstract screening. Following initial screening, 1147 articles were excluded for not meeting inclusion criteria. A total of 68 full-text articles were retrieved for detailed evaluation. After applying exclusion criteria, 18 studies were deemed eligible for inclusion in the final narrative synthesis. Figure 1 illustrates the PRISMA flowchart for the article selection process.

PRISMA flowchart for article selection process.
Data extraction
For each included study, data were extracted on:
Study design and objective
Type of AI model
Dataset characteristics (source, size, population)
Diagnostic performance metrics (accuracy, sensitivity, specificity, AUC, κ, etc.)
External validation and expert comparison details.
Quality and bias consideration
Although this is a narrative review, methodological rigor was maintained through transparent reporting of dataset sources, AI model types, and performance validation strategies. Cross-validation and external testing were emphasized to assess generalizability and mitigate dataset bias.
Results
AI has emerged as a transformative tool in the radiographic evaluation of KOA, specifically in automating KL grading. By addressing the inherent subjectivity and inter-observer variability of conventional grading, AI-based models have introduced objective, reproducible, and efficient diagnostic pathways. The literature demonstrates steady progress in this domain, encompassing diverse architectures, datasets, and validation designs. The evolution of AI in KL grading can be understood through its technological progression, methodological refinements, data diversity, and diagnostic capability. Table 3 summarizes the main characteristics of AI studies included in this review paper.
Summary of included studies on AI models for automated KL grading of knee osteoarthritis.
CNN, convolutional neural networks; DL, deep learning; KL, Kellgren–Lawrence; KOA, knee osteoarthritis; MOST, Multicenter Osteoarthritis Study; OA, osteoarthritis; OAI, Osteoarthritis Initiative; OARSI, Osteoarthritis Research Society International; PACS, Picture Archiving and Communication System; PIM, plug-in modules; ROI, region-of-interest.
Model architectures and methodological advances in AI-based KL grading
Over the past decade, DL architectures, particularly convolutional neural networks (CNNs), have evolved from basic classification frameworks into sophisticated, multitask systems capable of modeling complex disease features and progression patterns. These models are utilized in multiple ways, one of which is automated, reproducible KL grading on plain radiographs to diagnose knee OA.
Early AI applications approached KL grading as a conventional multiclass classification problem, using CNNs as the core modeling model. Tiulpin et al. 40 in 2018 pioneered the development of a transparent deep Siamese CNN-based computer diagnosis method. Following this, Vaattovaara et al. 3 employed a deep Siamese neural network trained on large, multi-institutional datasets such as MOST and OAI, which enabled paired-knee analysis and external validation, key steps toward model generalizability. The similar Deep Siamese architecture’s ability to simultaneously assess both knees improved consistency and mitigated dataset imbalance, demonstrating that DL models could match or even surpass human interobserver reliability, was reported by Cueva et al. 27 Building on this, Swiecicki et al. 16 utilized a Faster R-CNN architecture to detect, localize, and classify radiographic features of KOA, thereby introducing spatial attention into automated grading pipelines. These developments highlighted the feasibility of integrating feature localization with classification to improve diagnostic precision.
Subsequent advances have focused on addressing inherent challenges in KL grading, such as inter-grade ambiguity and class imbalance. ResNet-based classifiers and their derivatives have formed the backbone of most modern architectures, owing to their strong feature extraction capabilities and efficient gradient propagation.25,34 More sophisticated ensemble frameworks, combining CNN variants such as DenseNet, EfficientNet, and ResNeXt, have been introduced to enhance model robustness and interpretability.27,28 These ensemble networks leverage complementary feature hierarchies, improving generalization across variable imaging conditions and patient populations.
A key methodological advancement has been the treatment of KL grading as an ordinal regression problem at the model training and analysis stage, rather than as a purely categorical classification task. Ordinal frameworks such as rank-consistent ordinal regression explicitly encode the ordered progression of KL grades within the loss function, penalizing predictions that violate grade hierarchy. In practical terms, this constrains misclassifications to adjacent grades (e.g., KL 1 vs KL 2), rather than distant grades (e.g., KL 0 vs KL 4), thereby reflecting gradual changes in radiographic severity such as incremental osteophyte formation and progressive joint space narrowing. By aligning prediction errors with clinically plausible transitions in radiographic features, ordinal modeling improves diagnostic consistency and more closely mirrors radiologists’ reasoning when interpreting borderline or evolving disease stages. 9 Complementary to this, attention-based visualization methods such as Grad-CAM, 28 eigen-CAM,30,33 attention maps, 3 and feature saliency mapping28,35 have provided interpretability, revealing that AI models consistently focus on clinically relevant features such as osteophyte formation and joint space narrowing.
Recent studies have also explored multitask learning frameworks that simultaneously analyze multiple osteoarthritic features, including region-specific osteophyte detection, joint space narrowing segmentation (lateral and medial), and overall KL grade, within a unified network enabling more precise differentiation between adjacent severity grades.37,38 The explicit quantification of radiographic features improved KL grading accuracy by anchoring grade predictions to anatomically and pathologically relevant criteria, thereby reducing ambiguity and misclassification between adjacent severity grades. These architectures not only enhance prediction accuracy but also provide feature-level insights that parallel human radiologic interpretation. Cueva et al. 27 extended this concept by integrating radiomic and clinical data through an ensemble fusion network, combining CNN-derived imaging features with patient-level metadata to improve diagnostic precision in borderline cases.
Technological refinements such as transfer learning from large-scale image repositories (e.g., ImageNet) and data augmentation techniques have further improved model robustness, particularly in addressing the challenge of underrepresented KL grades. Additionally, YOLOv3-based architectures have been explored for lesion localization and severity classification, providing real-time detection capabilities suitable for clinical deployment. 34 Innovative adaptations such as the plug-in ensemble architecture proposed by Lee et al. 33 utilize each pixel as an independent feature, enabling fine-grained analysis that captures subtle radiographic variations often overlooked by conventional CNNs.
The field has also witnessed a movement toward accessibility and real-world implementation. Lee et al. 35 introduced a no-code AI platform employing a ResNet101 backbone for automated KL grading, aimed at democratizing AI use among clinicians without programming expertise. Similarly, studies by Brejnebøl et al. 36 and Yoon et al. 38 validated the performance of commercial AI tools such as RBKnee v2.1 and MEDI AI-OA in external clinical datasets, underscoring the translational maturity of AI-based KL classification systems.
Meta-analyses by Zhao et al. 41 have consolidated evidence from these diverse models, demonstrating pooled diagnostic accuracies exceeding 85% and highlighting the growing reproducibility and clinical readiness of AI in knee OA assessment. Yaylı et al. 29 compared two distinct modeling strategies: a single-model approach, in which CNNs were trained end-to-end to predict KL grades directly from knee radiographs, and a multi-model (feature-decomposed) approach, in which separate CNNs were first trained to detect individual pathological features such as joint space narrowing and osteophyte presence. Outputs from these feature-specific models were subsequently combined with the original radiographic images and basic demographic variables (age and sex) in an integrated prediction framework. Across seven CNN architectures, the single-model approach demonstrated superior accuracy and more stable calibration than the multi-model strategy.
Collectively, these methodological advances mark a paradigm shift in the radiographic assessment of KOA. AI-driven KL grading not only enhances diagnostic efficiency but also offers quantitative, reproducible metrics that could serve as surrogate biomarkers for disease progression and treatment response. As the field matures, the integration of explainable AI mechanisms, federated learning for multi-institutional data sharing, and real-world validation across diverse populations will be pivotal in bridging the gap between experimental performance and clinical adoption.
Datasets and study populations
The success and generalizability of AI models for automated KL grading of KOA depend heavily on the quality and diversity of datasets used for training and validation. Most studies have relied on two well-established, publicly available cohorts, the Osteoarthritis Initiative (OAI) and the Multicenter Osteoarthritis Study (MOST), which collectively underpin the majority of AI-based radiographic KOA research.3,8,16,26,28,32,37
The OAI dataset contains thousands of longitudinal bilateral knee radiographs (ages 45–79 years) with standardized acquisition protocols and expert-assigned KL grades across multiple time points. Similarly, the MOST dataset, comprising 10,052 radiographs from 3026 participants, focuses on the natural history of knee OA with imaging captured at several follow-ups. 8 These large, high-quality repositories have provided reproducible benchmarks for developing and validating DL models.
Many researchers have used subsets of these datasets tailored to specific aims. For example, Swiecicki et al. 16 utilized 2802 MOST radiographs in a Faster R-CNN architecture, Thomas et al. 26 analyzed over 32,000 OAI radiographs for CNN-based KL classification, and Yong et al. 9 implemented an ordinal regression approach using 4130 OAI images. Modified OAI datasets, cropped and separated by knee side to enhance region-of-interest (ROI) focus have been used by Pi et al. 28 and Pongsakonpruttikul et al. 34 to reduce background noise and improve feature localization.
Recent efforts emphasize external and multicenter validation to ensure generalizability beyond research-grade data. Vaattovaara et al. 3 trained their deep Siamese neural network on MOST, validated it on OAI, and externally tested it on 208 clinical radiographs from Oulu University Hospital. Likewise, Lee et al. 33 validated an ensemble model trained on OAI using 17,040 MOST images, while Wang et al. 39 combined OAI with a Taiwanese dataset (FEMH, 246 knees) to assess cross-population transferability. Kondal et al. 32 provided a valuable Indian dataset (1043 knees), demonstrating model adaptability to non-Western imaging conditions.
Institutional archives and multicenter datasets further enhance real-world representativeness. Olsson et al. 25 incorporated unfiltered hospital radiographs, including images with implants and casts, reflecting authentic clinical variability. Yaylı et al. 29 contributed a multicenter dataset of 14,607 annotated radiographs from three hospitals, while Brejnebøl et al. 36 established a Picture Archiving and Communication System-based validation workflow for the commercial AI tool RBknee v2.1.
To address dataset imbalance and overfitting, common preprocessing techniques, ROI cropping, histogram equalization, normalization, and resizing, are routinely applied. Data augmentation (rotations, flips, translations, contrast adjustments) and transfer learning from pretrained CNNs (e.g., ImageNet) have improved robustness across imaging conditions.27,28,37
Overall, the methodological trajectory has evolved from using standardized OAI/MOST datasets toward multicenter, population-diverse, and clinically acquired data, marking a critical step toward the translational readiness of AI-assisted KL grading systems.
Diagnostic model performance
Across the reviewed studies, AI models have demonstrated rapidly improving diagnostic accuracy for grading KOA using the KL system on plain radiographs. The reported models employed a range of architectures from conventional CNNs to advanced ensemble and ordinal regression frameworks, achieving diagnostic performances that increasingly parallel expert musculoskeletal radiologists. The most frequently reported performance metrics include accuracy, sensitivity, specificity, precision, recall, F1-score, AUC, and Cohen’s kappa coefficient. Table 4 presents the detailed performance metrics of the included studies.
Diagnostic performance metrics of included studies.
Definitions—Accuracy: Proportion of correctly classified KL grades among all predictions. Sensitivity: Ability of the model to correctly identify osteoarthritis-positive cases. Specificity: Ability of the model to correctly identify osteoarthritis-negative cases. Precision: Proportion of predicted positive cases that are truly positive. Recall: Proportion of true positive cases correctly detected by the model. F1-score: Harmonic mean of precision and recall, reflecting their balance. AUC: Area under the receiver operating characteristic curve, indicating overall discriminative performance. Cohen’s κ: Measure of agreement between model predictions and reference standard beyond chance.
AUC, area under the curve; NPV, Negative Predictive Value; PPV, Positive Predictive Value.
Early large-scale validation work by Vaattovaara et al. 3 using a deep Siamese neural network trained on the MOST and OAI datasets achieved a high diagnostic accuracy (AUC = 0.967; κ = 0.82) and substantial agreement with four expert readers (κ range = 0.74–0.82). This external validation study demonstrated strong reproducibility across multi-institutional cohorts despite class imbalance. Similarly, Swiecicki et al. 16 used a Faster R-CNN object-detection framework capable of analyzing both PA and lateral projections, reporting a mean weighted κ = 0.77 between the model and five radiologists, surpassing prior detection-based models that analyzed single projections independently.
Models based on ResNet architectures maintained strong discriminative capacity. Olsson et al. 25 reported a sensitivity of 97% and a specificity of 88% (AUC = 0.92) using a ResNet classifier trained on a Swedish clinical dataset, with challenges mainly observed in distinguishing KL grades 1 and 2, an ambiguity similarly recognized among human raters. Thomas et al. 26 demonstrated high agreement between CNN predictions and expert annotations (κ = 0.81–0.89), equaling or exceeding the highest reported inter-rater agreement (κ = 0.85) in literature, suggesting near-human-level performance.
Advancements in architectural design have improved both granularity and interpretability. Yong et al. 9 incorporated an Ordinal Regression Module into CNNs, achieving accuracy = 88.1% and AUC = 0.86, effectively minimizing misclassification between adjacent grades, particularly KL 2 and 3. Similarly, Pi et al. 28 explored ensemble CNNs (DenseNet-161, ResNet-101, EfficientNet-B5) on 8260 OAI images, demonstrating 76.9% accuracy and F1 = 0.77, with square input images yielding optimal spatial representation. The study also revealed that non-square images (especially tall ones) can distort features and hurt performance, which added a new dimension of research. These findings support that performance variations are driven by population composition, image resolution, and preprocessing choices: compressed or non-standard images reduce accuracy, 27 and models trained on standardized OAI images sometimes underperform on routine clinical radiographs due to positioning differences.26,39
Integrative frameworks combining radiomic and clinical features have shown further promise. Cueva et al. 27 developed an ensemble model that achieved high class-wise accuracy (91% for KL 4, 89% for KL 3) with 73% agreement among radiologists, indicating effective model generalization. Similarly, Yaylı et al. 29 compared seven single-model CNNs against a multi-model pipeline using 14,607 annotated X-rays, where the single NfNet architecture achieved accuracy = 0.767 and F1 = 0.763, outperforming the multi-model approach.
Other architectures, including Residual Networks (ResNet variants) and YOLOv3 detection frameworks, have also performed competitively. Mohammed et al. 31 achieved dataset-specific accuracies up to 0.89 with their multi-step ResNet classifier, while Pongsakonpruttikul et al. 34 attained AUC = 0.8 and accuracy = 86.7% for three-class OA categorization, emphasizing its practicality in limited-resource settings. Object-detection coupled regression models, such as the two-stage CNN proposed by Kondal et al., 32 demonstrated notable precision gains (0.73) and a low mean absolute error (0.28), particularly when treating KL grades as ordinal rather than nominal variables tested on the external Indian dataset.
Innovative ensemble strategies have further refined prediction stability. Lee et al. 33 implemented four plug-in modules (EfficientNet and Swin variants) on OAI and MOST datasets, demonstrated high overall performance (AUC = 0.94; accuracy = 75.6%), performing exceptionally in severe OA (Grade 4, sensitivity = 0.96) achieving in a comparable effort. Tiulpin et al. 37 used a dual SE-ResNet/SE-ResNeXt ensemble, achieving AUC = 0.95–0.98 and κ = 0.82, with superior performance for moderate-to-severe OA detection. In addition to predicting overall KL grades, Tiulpin et al. 37 and Yoon et al. 38 explicitly quantified key radiographic features underpinning KL grading, including regional osteophyte formation (medial and lateral femur and tibia) and compartment-specific joint space narrowing. By decomposing KL scoring into its constituent radiographic components, these models reduced ambiguity between adjacent grades, particularly KL 1–3 and achieved improved discriminative performance, with AUCs ranging from 0.95 to 0.98.
Real-world deployment studies have underscored the feasibility of clinical use. Brejnebøl et al. 36 externally validated the commercial AI tool Radiobotics RBknee v2.1 on 99 Danish knees, obtaining accuracy = 97.8% and perfect intra-rater agreement (κ = 1.0), closely mirroring consultant radiologists’ performance. Similarly, Wang et al. 39 demonstrated high cross-population reproducibility using a deep CNN (AUC = 0.936; κ = 0.81–0.86) on OAI and Taiwanese datasets, effectively identifying surgical candidates (KL 3–4) even across differing imaging protocols.
Finally, accessibility-focused innovations have begun to emerge. Lee et al. 35 developed a no-code ResNet101-based platform (DEEP:PHI), allowing non-programmers to train and validate AI models. Though validation metrics (training AUC = 0.89; validation AUC = 0.80) were moderate, the approach highlights the growing democratization of healthcare AI.
Overall, the collective diagnostic evidence underscores that CNN-based and ensemble DL models can reliably perform automated KL grading with accuracy values ranging from 0.75 to 0.98, AUCs up to 0.98, and inter-rater agreements (κ) between 0.67 and 0.86, comparable to, and in some cases exceeding, human expert performance. Collectively, these results demonstrate that contemporary AI models not only perform comparably to expert readers but also generalize well to independent hospital datasets. This external reproducibility signifies growing readiness for clinical adoption, especially for triage, second-opinion reporting, and longitudinal disease tracking.
Discussion
This discussion synthesizes key findings from the reviewed studies, contextualizing advances in AI-based KL grading in relation to methodological design, dataset characteristics, and clinical applicability. Despite significant progress in automated KL grading of KOA, several recurring limitations temper the clinical readiness and generalizability of current AI models. Across studies, four major constraints stand out: observer-dependent variability in ground truth labels, dataset imbalance and sampling bias, restricted external validation limiting generalization, and ethical issues.
Ground-truth variability and labeling subjectivity
A fundamental challenge across nearly all reviewed studies lies in the subjectivity of KL grading itself. Even among expert readers, the inter- and intra-observer variability remains substantial, κ values between 0.65 and 0.85 are commonly reported. This variability propagates into AI model training, as networks learn from inconsistent labels. Across included studies, definitions of KL grade 2 varied; therefore, this review reports results as presented by original authors rather than enforcing a uniform redefinition. Rather, this variability is highlighted as a central limitation and a key target for future consensus-building efforts aimed at standardizing radiographic criteria for KL grading. Studies referenced25,27,28,33 explicitly acknowledged that the ambiguity in mid-grade categories (KL 1–2) contributed to higher misclassification rates, mainly because of the vaguer or confusing definitions represented for those. Similarly, Vaattovaara et al. 3 reported that their model’s inter-rater agreement (κ = 0.74–0.82) was comparable to the variation among human readers, underscoring that AI performance is bounded by the inconsistency of its experienced radiologist-derived reference standard.
Dataset imbalance, diversity, and external validation
Dataset imbalance represents another key limitation. Many studies trained on the OAI and MOST datasets, where mild or moderate OA grades dominate, while advanced OA and normal knees are underrepresented. For instance, Mohammed et al. 31 observed severe class imbalance (3857 images in grade 0 vs only 295 in grade 4), leading to overfitting toward prevalent classes. Although data augmentation strategies, such as rotation, flipping, and brightness normalization, were employed in several studies9,26 to mitigate imbalance, they cannot fully replicate the morphological diversity of underrepresented categories. Moreover, the continued reliance on OAI and MOST restricts demographic diversity: both datasets predominantly include middle-aged to older adults from Western populations (predominantly the United States population) with standardized imaging protocols, with limited ethnic and anatomical variability. This raises concerns about geographic, ethnic, and clinical diversity underlying model training, potentially limiting generalizability to broader clinical settings where patient characteristics, radiographic acquisition parameters, and disease prevalence differ. Another critical issue is data standardization and image quality. Differences in acquisition angles, beam positioning, and knee alignment may distort JSW, altering AI predictions.26,39 While transfer learning and augmentation techniques were employed to counteract this, few studies performed true multicenter harmonization.
A related issue is the generalizability of trained models beyond controlled research environments. Only a few studies conducted rigorous external validations.3,32,36,39 These efforts revealed that models often perform well on in-domain data but degrade when exposed to images from new institutions or imaging systems. Brejnebøl et al. 36 demonstrated strong transferability of a commercial AI system (RBknee v2.1), yet cautioned that population homogeneity (predominantly European ethnicity) may limit extrapolation to global cohorts. Similarly, Wang et al. 39 noted that image quality, nonstandard AP positioning, and hardware variability across centers affected diagnostic consistency. Though external validation sets are present, often smaller and derived from single institutions, further compounding concerns about real-world robustness.
To address these gaps, future studies should prioritize the inclusion of multi-site, multi-hardware datasets encompassing varied imaging environments and underrepresented populations. Recent work in allied AI imaging domains has underscored that dataset heterogeneity enhances model resilience to domain shifts and mitigates performance degradation when deployed outside the original training context. 42 Furthermore, collaborative benchmarking efforts across international cohorts can support a more comprehensive assessment of diagnostic tools and ensure that performance claims are not confounded by population or scanner biases.
Methodological limitations and overfitting risks
Technical limitations also persist. Studies employing complex ensemble frameworks10,28,33 achieved superior metrics but at the expense of computational efficiency and interpretability pose a critical barrier for real-world integration. The “black-box” nature of deep networks remains a concern, with few studies providing saliency maps or Grad-CAM explanations to validate model reasoning. Also, hardware constraints, annotation variability, and limited open-access clinical validation studies restrict scalability across healthcare environments.29,35,37 Additionally, Olsson et al. 24 and Lee et al. 35 emphasized that models trained solely on radiographs overlook relevant clinical correlates (e.g., pain scores, BMI, or function), which may limit their prognostic utility.
The differing findings reported by Yaylı et al. 29 and by Tiulpin et al. 37 and Yoon et al. 38 primarily reflect methodological differences in how feature decomposition was implemented rather than a contradiction in principle. Yaylı et al. showed that an end-to-end single-model CNN outperformed a multi-model framework in which separately trained feature detectors (osteophytes and joint space narrowing) and demographic variables were fused, likely due to suboptimal feature integration, redundancy, and error propagation from auxiliary models, leading to increased complexity and overfitting. In contrast, Tiulpin et al. and Yoon et al. embedded feature decomposition within tightly integrated, quantitatively defined frameworks, explicitly modeling region-specific osteophyte formation and compartment-specific joint space narrowing using normalized and clinically grounded measures. Notably, the use of relative joint space narrowing metrics and structured or ordinal learning reduced ambiguity between adjacent KL grades and improved discriminative performance. Collectively, these findings suggest that feature decomposition enhances KL grading only when radiographic features are rigorously quantified and integratively modeled; otherwise, simpler end-to-end approaches may yield more stable performance.
Overfitting remains a persistent methodological challenge in AI-based radiographic analysis, occurring when models learn dataset-specific patterns that fail to generalize beyond the training distribution. This risk is particularly pronounced in studies where training and evaluation data share similar imaging protocols or originate from the same institutional sources. Such conditions may lead to overly optimistic performance estimates that do not reflect real-world clinical deployment. None of the reviewed studies reported formal bias–variance decomposition or explicit quantitative measures of the overfitting–generalizability trade-off. Instead, overfitting was assessed indirectly through discrepancies between internal validation and external testing performance, calibration degradation, or reduced accuracy when models were applied to data from different institutions or imaging protocols. Studies that included external validation consistently demonstrated performance attenuation on out-of-domain datasets, suggesting variance-dominated behavior in models trained on homogeneous data sources. The absence of standardized reporting on bias–variance balance limits direct comparison of model robustness across studies and highlights an important methodological gap in the current literature.
To enhance robustness and generalizability, several methodological safeguards are warranted. First, the use of multi-institutional training and independent external testing cohorts can better capture variability in patient anatomy, disease presentation, and imaging hardware. Second, more rigorous cross-validation frameworks, including nested and repeated k-fold validation, should be adopted to reduce bias arising from chance data partitioning. 42 Incorporation of regularization strategies and uncertainty estimation methods, including Bayesian approaches, can further mitigate overconfident predictions, particularly when models encounter out-of-distribution samples. Additionally, domain adaptation techniques may help align feature representations across heterogeneous imaging environments, reducing sensitivity to site-specific artifacts. Finally, transparent reporting of patient-level and institution-level dataset separation is essential to prevent inadvertent data leakage that can artificially inflate performance metrics. Similar recommendations have been emphasized in broader ML literature, underscoring the importance of rigorous validation practices for reproducible and clinically reliable AI systems.
Ethical, interpretability, and deployment limitations
Beyond algorithmic performance, the clinical translation of AI-based KL grading systems requires careful consideration of ethical, interpretability, and operational challenges. Interpretability is another critical determinant of clinical acceptance. Models that provide transparent outputs such as attention maps or explicit quantification of radiographic features like osteophytes or joint space narrowing are more likely to gain clinician trust, as these outputs can be directly correlated with familiar radiographic signs. Equally important are practical deployment considerations, including seamless integration with existing tools, intuitive user interfaces, and minimal disruption to established radiology workflows. Multi-disciplinary collaboration between clinical experts, AI scientists, and product managers is critical in implementing a Responsible AI. 43
Infrastructure and scalability further influence real-world adoption, particularly in resource-constrained settings where computational capacity and maintenance support may be limited. Moreover, transparency challenges associated with proprietary or commercial AI systems can hinder independent evaluation of safety and efficacy. Clear disclosure of training data characteristics, input and output handling methods, validation strategies, subgroup-specific performance metrics, and error analysis should therefore be considered essential. 44 Finally, continuous post-deployment auditing is necessary to monitor performance drift, emerging biases, and unintended consequences as clinical environments and patient populations evolve. 43
Collectively, these limitations highlight a persistent gap between algorithmic performance and clinical applicability. While CNN-based models now rival expert readers in KL grading accuracy, they continue to depend on imperfect training data and limited real-world validation. Addressing these challenges will require (1) standardized, multi-reader consensus labeling frameworks to minimize annotation bias, (2) balanced and demographically diverse datasets to improve equity and robustness, (3) transparent model reporting standards, including explainability metrics and cross-institutional validation, and (4) ethical, interpretable, and scalable solutions to ensure trustworthy deployment in clinical workflows.
Future directions
Future research should emphasize methodological rigor, diversity, and clinical relevance rather than incremental performance gains. Large-scale, multi-institutional datasets spanning diverse populations, imaging hardware, and acquisition protocols are essential to improve robustness and equity in AI-based KL grading. Given that the KL score is fundamentally a radiographic severity index, future AI efforts should preserve its image-based diagnostic nature and focus on improving grading consistency, interpretability, and reproducibility rather than incorporating non-imaging clinical variables into the grading process itself. Importantly, while clinical variables such as BMI, pain scores, functional impairment, and prior joint injury are critical for patient management and prognostication, their integration is more appropriately situated within downstream clinical or epidemiologic models that use AI-derived KL grades as standardized inputs, rather than as components of KL score computation. This separation avoids construct contamination and mitigates risks of collinearity when similar variables are subsequently adjusted for in disease progression or outcome prediction models. In this context, multimodal frameworks may support comprehensive osteoarthritis risk stratification pipelines, while preserving the conceptual and methodological integrity of KL grading as a diagnostic measure.32,34,37 Expanding datasets through federated learning, allowing decentralized model training across institutions, will enhance diversity and reduce bias while preserving data privacy. 35
To reduce overfitting, future studies should adopt stronger validation frameworks, including independent external testing, clear reporting of dataset separation, and uncertainty estimation. Particular attention should be given to borderline KL grades, where labeling ambiguity and misclassification are most pronounced.
The adoption of explainable AI frameworks such as Grad-CAM and attention heatmaps can bolster clinician trust and regulatory acceptance by visualizing AI decision pathways. To avoid misclassifications of middle grades because of vaguer definitions and adoption of different variations for same grade, a consensus process is suggested to develop clear, modified definitions, especially for grades 1 and 2.13,22 To ensure sustained clinical relevance, models must undergo longitudinal validation to assess their capacity to predict OA progression and treatment outcomes.25,35,39 The emergence of no-code AI platforms (e.g., Lee et al. 35 ) can democratize model development for clinicians without programming expertise, facilitating translation into real-world healthcare workflows.
Finally, successful translation will require attention to ethical deployment and human–AI interaction. AI systems should be positioned as decision-support tools rather than replacements for expert judgment, with continuous post-deployment auditing to ensure consistent performance across patient subgroups. Addressing these considerations will be critical to realizing clinically trustworthy, scalable, and equitable AI-assisted KL grading systems.36,39,44
Ultimately, the future of AI-assisted KL grading lies in clinically explainable, ethically grounded, and technically robust systems validated across diverse populations and imaging standards. Such advances will enable precision stratification of osteoarthritis severity, streamline radiographic interpretation, and foster equitable access to AI-driven musculoskeletal care worldwide.
Conclusion
AI has redefined the landscape of radiographic KOA assessment by automating the KL grading system with remarkable accuracy and reproducibility. The reviewed evidence demonstrates that CNN-based and ensemble DL models can perform KL grading with diagnostic metrics equivalent to expert readers, often achieving accuracies above 85% and AUC values nearing 0.98. Methodological innovations, such as ordinal regression, multi-task learning, and ensemble architectures, have refined model interpretability and minimized misclassification across disease stages.
Yet, despite these advances, real-world implementation remains constrained by the subjectivity of KL labels, dataset imbalance, and limited cross-population validation. The overreliance on Western-centric datasets such as OAI and MOST restricts demographic representation and model fairness. Furthermore, the “black-box” nature of deep models continues to challenge interpretability and clinician trust.
Future progress demands a shift toward standardized, explainable, and ethically sound AI systems validated across diverse, multicenter cohorts. The incorporation of multicenter data, federated learning for privacy-preserving collaboration, and human–AI collaborative frameworks will bridge the gap between algorithmic precision and clinical reality. As accessibility increases through no-code platforms and open-source validation frameworks, AI-assisted KL grading stands poised to revolutionize musculoskeletal diagnostics, delivering precision, equity, and efficiency in KOA care worldwide.
