Artificial intelligence in Kellgren–Lawrence grading of knee osteoarthritis: bridging radiographic tradition with algorithmic precision

Abstract

Knee osteoarthritis (KOA) remains the most prevalent form of osteoarthritis and a major cause of global disability. The Kellgren–Lawrence (KL) grading system, though widely used, suffers from inter- and intra-observer variability, especially in early disease stages. Artificial intelligence (AI) offers a transformative approach to automate KL grading on plain radiographs, providing consistent, reproducible, and scalable diagnostic solutions. This narrative review synthesizes recent advances in AI-based KL grading models, focusing on methodological frameworks, performance, clinical applicability, and limitations. Narrative review of peer-reviewed studies applying AI-based methods for KL grading of KOA on radiographic images. Literature search was conducted across PubMed, Embase, Web of Science, and Google Scholar to identify studies published between 2016 and 2025. Eligible studies satisfied predefined selection criteria, applied AI-based methods to radiographic grading of KOA. The review focused on model architectures, dataset characteristics, validation strategies, performance metrics, and comparisons with expert radiographic assessment. Eighteen eligible studies were included. Convolutional neural networks (CNN) remain the core of automated KL grading, evolving from standard classification models to ensemble and ordinal regression frameworks. Model performance was evaluated against expert-assigned KL grades as reference standard, with reported accuracies ranging from 75% to 98% and area under the curve values up to 0.98. Agreement with expert annotations, Cohen’s kappa (κ), ranged from 0.67 to 0.86. Deep Siamese networks, Faster R-CNNs, and ensemble frameworks have enhanced localization of KOA radiographic features, thereby interpretability relative to human radiologic assessment. Ordinal regression and attention-based visualization (saliency and class activation mappings) reduced misclassification between adjacent KL grades. Persistent challenges included subjective ground-truth labeling, dataset imbalance particularly under-representation of early (KL 0–1) and severe (KL 4) disease, and limited external validation. Models trained primarily on Osteoarthritis Initiative and Multicenter Osteoarthritis Study datasets showed reduced generalizability on external hospital datasets. AI-driven KL grading demonstrates near-human accuracy and strong promise for clinical integration. However, addressing labeling subjectivity, dataset diversity, and explainability remains essential for trustworthy deployment. While KL grading is inherently radiograph-based, integration of clinical metadata and longitudinal radiographic data may support more robust disease characterization. Federated learning frameworks offer a pathway to improve generalizability while preserving data privacy.

Plain language summary

How artificial intelligence can improve the Kellgren–Lawrence grading of knee osteoarthritis diagnosis on X-rays

What is this research about? Knee osteoarthritis (KOA) is a common joint disease, causing pain and reduced mobility. Its severity is typically assessed using the Kellgren–Lawrence (KL) grading system based on knee X-rays, but results can vary between observers, especially in early disease. This review examined how artificial intelligence (AI) can improve the consistency, speed, and objectivity of KL grading.

Why is this study important? KOA is a major global cause of disability. Early and accurate diagnosis is critical to slow disease progression and delay surgical intervention. Because traditional KL grading depends on expert interpretation and is prone to variability, AI-assisted grading offers the potential to standardize assessments and support clinical decision-making.

How is the research conducted? The review analyzed studies published between 2016 and 2025 that applied AI to grade KOA severity from knee X-rays. It evaluated AI model types, datasets, and performance compared with expert readers. Most studies used deep learning models capable of learning imaging patterns automatically.

What does the research find? Across 18 studies, AI models achieved accuracies between 75% and 98%, often matching expert performance. Convolutional neural networks (CNNs) were most effective, identifying features such as joint space narrowing and osteophytes. Ordinal regression and attention-based visualization improved grading of adjacent disease stages and interpretability. Key limitations included inconsistent labeling, limited dataset diversity, and scarce external validation.

What do these findings mean? AI-based KL grading can provide fast, reproducible, and objective assessments, supporting clinicians, especially where radiology expertise is limited. While not a replacement for clinicians, AI shows strong potential as a decision-support tool. Future work should prioritize diverse datasets, explainable models, and real-world validation to enable safe clinical adoption.

Keywords

artificial intelligence convolutional neural networks deep learning diagnostic imaging Kellgren–Lawrence grading knee osteoarthritis machine learning observer variability radiography

Introduction

Knee osteoarthritis (KOA) is the most common manifestation of osteoarthritis.^1,2 It is a chronic, progressive musculoskeletal orthopedic condition presented with stiffness, swelling, and pain in the affected knee joint.³ It represents a major contributor to global disability and healthcare burden. Recent estimates suggest that by 2021, there were approximately 606.5 million people living with osteoarthritis,⁴ and KOA accounts for 60%–85% of all osteoarthritis (OA) cases,^1,2 making it the most frequently affected joint. By 2020, KOA affected an estimated 4.9% of the global population (4711 cases per 100,000), with prevalence notably higher in women (6.0%) than in men (3.8%).⁵ The global incidence rate was approximately 381 per 100,000 people annually, and KOA accounts for an estimated loss of 149 disability-adjusted life years per 100,000, underlining its burden on quality of life, mobility, and independence.⁵ The burden of KOA continues to rise in parallel with demographic and lifestyle shifts. Between 1990 and 2020, the global number of KOA cases increased by over 130%, and future projections estimate a further 75% increase by 2050.⁶ Importantly, KOA is no longer confined to older adults. Data from the Global Burden of Disease Study revealed that in 2019, more than half (52.3%) of new OA cases occurred in individuals under the age of 55, reflecting an alarming trend of early-onset disease.⁷ This increase has been strongly associated with modifiable risk factors such as high body mass index (BMI), which accounted for 15.3% of early-onset OA cases in 2019, up from 9.4% in 1990.⁷

KOA poses significant challenges for healthcare systems and patients alike. The disease leads to pain, stiffness, functional decline, and eventually long-term disability, often requiring joint replacement surgery. With aging populations and rising obesity rates, KOA is projected to remain a leading cause of disability worldwide, underscoring the need for improved diagnostic tools, early detection strategies, and interventions to slow disease progression. Conventionally, KOA is diagnosed using symptoms and imaging methods. The common first line of diagnostics is plain radiographs, due to their reduced cost and wider availability.^3,8,9 The radiographs are looked for joint space width (JSW) narrowing, osteophyte growth, subchondral sclerosis, cysts, and progressive bone contour changes to confirm the diagnosis.^10,11 Accurate radiographic assessment remains central to the diagnosis and monitoring of KOA, with the Kellgren–Lawrence (KL) grading system and the Osteoarthritis Research Society International (OARSI) atlas being the most widely adopted standards. KL grading, a semi-quantitative scoring method, has a long history back to 1957, and is highly used as a clinical research tool in epidemiological studies, providing 0–4 grading scores based on the JSN, marginal osteophyte formation, subchondral sclerosis, subchondral cysts and alterations in femoral condyles and tibial plateau contours.^8,9,11–13 However, conventional KL grading is subjective and prone to inter-observer variability. Recent advances in artificial intelligence (AI) and machine learning (ML) have opened opportunities to automate KL scoring on plain radiographs, offering greater consistency, efficiency, and potential for early detection. Exploring these applications is critical to understanding how AI can support clinicians in improving KOA diagnosis and management. This study was intentionally designed as a narrative review rather than a systematic review or meta-analysis. While a recent high-quality systematic review by Mohammadi et al. (2024) comprehensively quantified diagnostic accuracy metrics for AI-based KL grading, the rapidly evolving and methodologically heterogeneous nature of deep learning (DL) approaches in this field limits the interpretability of pooled estimates.¹⁴ A narrative framework was therefore adopted to enable a broader, conceptually driven synthesis of model architectures, grading paradigms, dataset characteristics, labeling variability, validation strategies, and interpretability considerations that extend beyond quantitative performance metrics. This approach allows critical appraisal of methodological design choices and emerging research directions that are not readily captured through formal meta-analysis.

Traditional KL grading and its limitations

The KL grading system, introduced in 1957, was developed to address the inconsistency clinicians faced when interpreting radiographic features of osteoarthritis. Early investigations by KL on coal miners revealed substantial disagreement both between different observers and within the same observer at various times when evaluating radiographic changes.¹⁵ To improve reproducibility, they proposed a five-grade classification supported by reference images, which has since become the most widely used tool for defining disease severity in KOA.^8,13 In their original work, KL demonstrated that reliability of radiographic grading varied substantially across joints, with interobserver correlation coefficients ranging from as low as 0.10 in the wrist to as high as 0.83 in the knee, while intraobserver correlations showed a similar spread, from 0.42 in the dorsolumbar spine up to 0.88 in the metacarpophalangeal joint; notably, the knee exhibited one of the strongest agreements (r = 0.83 for both inter- and intraobserver reliability).^8,12 The KL system has been extensively applied in both clinical and research settings, ranging from stratifying participants in epidemiological studies to guiding surgical decisions and monitoring disease progression, owing to its simplicity, global acceptance, and ability to summarize complex radiographic findings into standardized categories.⁸ Table 1 presents the KL-grading system with description and radiographic features for KOA.

Table 1.

KL grading system for radiographic knee osteoarthritis.^8,11,12,16

KL grade	Definition	Key radiographic features
0 (normal)	No radiographic features of OA	Normal joint structure; no osteophytes, sclerosis, or joint space narrowing
1 (doubtful OA)	Doubtful joint space narrowing and possible osteophytic lipping	Very small or indistinct osteophytes; no definite joint space narrowing; early/minimal changes
2 (mild OA)	Definite osteophytes and possible joint space narrowing	Clear osteophyte formation on joint margins or tibial spines; joint space narrowing may be present; early subchondral bone changes (sclerosis)
3 (moderate OA)	Multiple osteophytes, definite joint space narrowing, some sclerosis, possible bone end deformity	Moderate reduction in joint space; multiple osteophytes; definite subchondral sclerosis; early deformity of bone ends; occasional subchondral pseudocysts
4 (severe OA)	Large osteophytes, marked joint space narrowing, severe sclerosis, definite bone end deformity	Extensive osteophyte formation; severe loss of joint space; pronounced subchondral sclerosis; bone contour deformity (e.g., femoral condyle or tibial plateau); pseudocyst formation common

In most epidemiological and clinical studies, a KL grade ⩾2 is considered the reference standard threshold for the presence of radiographic knee osteoarthritis.

KL, Kellgren–Lawrence; OA, osteoarthritis.

Despite these strengths, a major limitation of KL grading lies in its susceptibility to inter- and intraobserver variability. Studies have shown that agreement between clinicians is often only moderate, particularly in borderline or early disease stages, while repeatability by the same observer may also vary significantly. Such variability can lead to inconsistent disease classification, potentially affecting clinical decision-making and complicating the design and interpretation of research studies. Table 2 presents the inter- and intra-observer variability of KL grading mentioned in various studies.

Table 2.

Summary of studies presenting inter- and intra-observer variability of KL grading.

Author (year)	Study population/dataset	Observers	Radiographs/views used	Interobserver ICC	Intrareader ICC	Comments/notes
Wright (2014)¹⁷	632 patients from MARS, 83 surgeons, 52 sites	3 independent, blinded observers (qualifications not specified)	Weight-bearing AP and/or Rosenberg	Rosenberg: 0.54 AP: 0.38	Not reported	Wide range in reliability due to technique, age, OA degree, and observer interpretation
Felson et al. (1987)¹⁸	1424 elderly patients (age 63–94, mean 73) from Framingham population	2 academically based bone and joint radiologists	Standing AP	0.85	Not reported	High interrater reliability; population lacked ethnic diversity
Damen et al. (2014)¹⁹	1002 participants 45–65 years. Early OA hip and knee cohort	4 research assistants trained by an experienced musculoskeletal radiologist and one experienced general practitioner	PA radiographs of the knee	K&L ⩾1 0.58	Not reported	Higher reliability for osteophytes than for JSN between readers
Scott et al. (1993)²⁰	30 standing AP knee radiographs from Baltimore Longitudinal Study of Aging (25 men, 5 women; age 42–84)	2 skeletal radiologists + 2 rheumatologists	Standing AP	0.68	0.87	Random selection; limited demographic info; equal radiographs per KL grade
Gossec et al. (2008)²¹	50 radiographs selected from 1759 radiographs across 5 databases	2 trained rheumatologists	Standing, extended knee	0.72	0.72	Selection criteria for 50 radiographs not explicitly stated; KL grades not confirmed equal

ACL, Anterior Cruciate Ligament; AP, Anteroposterior; JSN’ Joint Space Narrowing; KL, Kellgren–Lawrence; MARS, multicenter ACL revision study; OA, osteoarthritis.

Its ordinal nature often oversimplifies the complex spectrum of structural changes in KOA, and subtle pathological features may go undetected. In addition to this, KL grading is often described with different alternative descriptions for each class, especially for Grade 2, cut-off to determine knee OA. The original KL grade 2 definition is “definite osteophytes and possible narrowing of joint space.” However, subsequent studies have adopted divergent interpretations. For example, Jordan et al.²² classified radiographs demonstrating definite osteophytes without any joint space narrowing as KL grade 2, thereby emphasizing osteophyte presence alone as sufficient for diagnosis. In contrast, Hart et al.²³ defined KL grade 2 as minimal but definite small osteophytes accompanied by minimal joint space narrowing, thereby requiring early structural narrowing in addition to osteophyte formation. Williams et al.²⁴ further broadened the definition of KL grade 2 as minimal osteophytes with possible joint space narrowing, additionally incorporating features such as cyst formation and subchondral sclerosis. These alternative descriptions ranging from preserved joint space to inclusion of additional bony changes reflect a lack of consensus regarding the minimal radiographic criteria required for KOA diagnosis at Grade 2. This may create confusion in assigning grades and impact the clinical diagnosis. Furthermore, the process is time-intensive, requiring expert input, which limits scalability in large research cohorts and slows down clinical trial recruitment.

Because of these limitations, there is a strong case for developing methods that reduce subjectivity, improve sensitivity (especially in early disease), and allow standardized, reproducible, and perhaps automated assessments. The results of the presented studies have spurred growing interest in AI-based solutions, which promise to overcome the subjectivity and inefficiencies inherent to traditional KL grading by offering greater accuracy, reproducibility, and scalability.

Methodology

Search strategy

A comprehensive literature search was conducted across PubMed, Embase, Web of Science, and Google Scholar to identify relevant studies on AI applications for KL grading of KOA using plain radiographs. The search covered all publications from January 2016 to September 2025. Search terms included combinations of: (“Kellgren-Lawrence” OR “KL grading”) AND (“artificial intelligence” OR “deep learning” OR “machine learning”) AND (“knee osteoarthritis” OR “KOA”) AND (“radiograph” OR “X-ray”).

Selection criteria

Studies were included if they:

Applied AI, ML, or DL models for KL grading on plain knee radiographs.

Used any one or more standard knee radiographic views (anteroposterior, posteroanterior, fixed-flexion, or equivalent).

Reported quantitative diagnostic or comparative performance metrics (e.g., accuracy, area under the curve (AUC), κ).

Provided sufficient methodological detail on dataset composition or preprocessing.

Studies were excluded if they:

Conference abstracts, editorials, narrative reviews, or systematic reviews.

Animal-based or in vitro studies.

Imaging modalities other than plain radiographs (e.g., MRI, CT, ultrasound).

Studies focusing primarily on other scoring systems (e.g., OARSI atlas).

Study selection

The initial search identified 1386 records (PubMed: 276, Embase: 81, Web of Science: 440, Scopus: 246, Google Scholar: 1110). After removing duplicates, 1215 unique articles remained for title and abstract screening. Following initial screening, 1147 articles were excluded for not meeting inclusion criteria. A total of 68 full-text articles were retrieved for detailed evaluation. After applying exclusion criteria, 18 studies were deemed eligible for inclusion in the final narrative synthesis. Figure 1 illustrates the PRISMA flowchart for the article selection process.

Figure 1.

PRISMA flowchart for article selection process.

Data extraction

For each included study, data were extracted on:

Study design and objective

Type of AI model

Dataset characteristics (source, size, population)

Diagnostic performance metrics (accuracy, sensitivity, specificity, AUC, κ, etc.)

External validation and expert comparison details.

Quality and bias consideration

Although this is a narrative review, methodological rigor was maintained through transparent reporting of dataset sources, AI model types, and performance validation strategies. Cross-validation and external testing were emphasized to assess generalizability and mitigate dataset bias.

Results

AI has emerged as a transformative tool in the radiographic evaluation of KOA, specifically in automating KL grading. By addressing the inherent subjectivity and inter-observer variability of conventional grading, AI-based models have introduced objective, reproducible, and efficient diagnostic pathways. The literature demonstrates steady progress in this domain, encompassing diverse architectures, datasets, and validation designs. The evolution of AI in KL grading can be understood through its technological progression, methodological refinements, data diversity, and diagnostic capability. Table 3 summarizes the main characteristics of AI studies included in this review paper.

Table 3.

Summary of included studies on AI models for automated KL grading of knee osteoarthritis.

First author (year) (reference number)	Type of study	Objective	Dataset (study population)	Type of AI model	Expert comparison	Limitations
Vaattovaara et al. (2025)³	Retrospective study	Evaluate performance of a deep learning model in KL grading against expert readers using an external dataset.	MOST (training), OAI (validation), Oulu University Hospital (testing); 106 subjects (208 radiographs, mean age 58).	Deep Siamese Neural Network	Compared with three radiologists and 1 orthopedic surgeon	Limited sample size for external validation; single imaging modality.
Yong et al. (2022)⁹	Retrospective comparative study	Improve KL grading by modeling as an ordinal regression task.	OAI dataset (4130 radiographs, 4796 participants).	CNN with ordinal regression module	Compared with baseline CNNs; three expert readers (two radiologists, 1 adjudicator)	Limited external validation and dataset variability.
Swiecicki et al. (2021)¹⁶	Retrospective diagnostic study	Develop a fully automated DL algorithm for KL grading matching radiologists’ performance.	MOST dataset; 2802 images (train/validation/test).	Faster R-CNN	Compared with five radiologists	Generalizability across imaging centers not evaluated.
Olsson et al. (2021)²⁵	Retrospective study	Classify OA severity in an unfiltered population including implants and non-degenerative findings.	Danderyd University Hospital dataset.	ResNet CNN	Compared with two orthopedic surgeons	Lack of external validation; small dataset diversity.
Thomas et al. (2020)²⁶	Retrospective study	Automate knee OA severity classification using CNN.	OAI dataset; 32,116 images (training/testing subsets).	CNN	Compared with two musculoskeletal radiologists	Limited interpretability; model may overfit to OAI data.
Cueva et al. (2022)²⁷	Experimental diagnostic study	Integrate radiographic and clinical data for OA detection/classification.	Public knee OA datasets; 225 subjects (balanced KL grades).	Ensemble CNN + Radiomics	Compared with three radiologists	Limited sample size; lack of prospective validation.
Pi et al. (2023)²⁸	Retrospective diagnostic study	Develop ensemble CNN model for accurate KL grading.	OAI dataset (8260 images, 4796 participants aged 45–79).	Ensemble CNN (DenseNet-161, ResNet-101, EfficientNet-B5)	Radiologist-validated dataset	Limited explainability; single-cohort data.
Yaylı et al. (2025)²⁹	Multicenter original research	Compare single- vs multi-model CNNs for KL classification.	14,607 annotated AP X-rays (three hospitals).	Single- and multi-model CNN pipelines	Inter-model comparison; no clinical reader head-to-head	Lack of human comparison; focus on computational benchmarking.
Tariq et al. (2023)³⁰	Original research (IEEE Access)	Develop CNN pipeline for KOA detection and KL classification.	Public datasets (OAI + local); thousands of X-rays.	CNN variants	Compared to prior models (no expert readers)	Sensitivity/specificity inconsistently reported; small local validation.
Mohammed et al. (2023)³¹	Original research (Diagnostics)	Build ResNet classifiers for KOA grading.	OAI and public datasets; several thousand images.	ResNet variants	Compared with prior models	Limited reporting of human expert comparison and dataset balance.
Kondal et al. (2020)³²	Methodological paper	Develop CNN/regression pipeline tuned to Indian radiographs.	OAI (training), private hospital (test).	Object detection + regression CNN	No expert comparison	Preprint; lacks peer-reviewed evaluation.
Lee et al. (2024)³³	Original research	Develop ensemble PIM for fine-grained KL grading.	OAI (train), MOST (test; 17,040 images).	Ensemble DL (EfficientNet + Swin variants)	Compared with prior models; expert-labeled OAI/MOST	Limited generalizability outside OAI/MOST datasets.
Pongsakonpruttikul et al. (2022)³⁴	Cross-sectional diagnostic study	Apply YOLOv3-tiny for detection and severity classification.	OAI dataset (1650 radiographs).	YOLOv3-tiny CNN	Radiologist/orthopedist labeled ROIs	Limited dataset and simplified architecture.
Lee et al. (2024)³⁵	Retrospective experimental study	Evaluate no-code AI platform (DEEP:PHI) for KL grading.	OAI (1526 patients: 717 train, 405 valid, 404 test).	ResNet-101 CNN	No direct expert comparison; OAI-graded dataset	Performance metrics not clearly reported.
Brejnebøl et al. (2022)³⁶	External validation study	Validate commercial RBknee AI against radiologists.	99 knees (50 patients) from Denmark PACS.	CNN (RBknee v2.1)	Compared with six radiology professionals	Small sample; commercial model transparency limited.
Tiulpin and Saarakkala (2020)³⁷	Retrospective validation	Develop multitask CNN for KL and OARSI grading.	OAI (train) + MOST (test); thousands of knees.	Ensemble CNN (SE-ResNet50, SE-ResNext50)	Compared with state-of-the-art and prior models	Complex ensemble requires high computational resources.
Yoon et al. (2023)³⁸	Retrospective validation	Develop MediAI-OA for feature extraction and KL grading.	OAI (44,193 train + 810 valid) + local test (400 knees).	HRNet, RetinaNet, NASNet	Compared with two orthopedic surgeons and one radiologist	Limited local dataset for final evaluation.
Wang et al. (2022)³⁹	Retrospective diagnostic study	Evaluate deep CNN model for real-world OA grading.	OAI (8964 knees) + FEMH (246 knees, Taiwan).	Deep CNN	Compared with two orthopedic surgeons and one radiologist	Lower performance in reading poor-quality images and lower-quality radiographs.

CNN, convolutional neural networks; DL, deep learning; KL, Kellgren–Lawrence; KOA, knee osteoarthritis; MOST, Multicenter Osteoarthritis Study; OA, osteoarthritis; OAI, Osteoarthritis Initiative; OARSI, Osteoarthritis Research Society International; PACS, Picture Archiving and Communication System; PIM, plug-in modules; ROI, region-of-interest.

Model architectures and methodological advances in AI-based KL grading

Over the past decade, DL architectures, particularly convolutional neural networks (CNNs), have evolved from basic classification frameworks into sophisticated, multitask systems capable of modeling complex disease features and progression patterns. These models are utilized in multiple ways, one of which is automated, reproducible KL grading on plain radiographs to diagnose knee OA.

Early AI applications approached KL grading as a conventional multiclass classification problem, using CNNs as the core modeling model. Tiulpin et al.⁴⁰ in 2018 pioneered the development of a transparent deep Siamese CNN-based computer diagnosis method. Following this, Vaattovaara et al.³ employed a deep Siamese neural network trained on large, multi-institutional datasets such as MOST and OAI, which enabled paired-knee analysis and external validation, key steps toward model generalizability. The similar Deep Siamese architecture’s ability to simultaneously assess both knees improved consistency and mitigated dataset imbalance, demonstrating that DL models could match or even surpass human interobserver reliability, was reported by Cueva et al.²⁷ Building on this, Swiecicki et al.¹⁶ utilized a Faster R-CNN architecture to detect, localize, and classify radiographic features of KOA, thereby introducing spatial attention into automated grading pipelines. These developments highlighted the feasibility of integrating feature localization with classification to improve diagnostic precision.

Subsequent advances have focused on addressing inherent challenges in KL grading, such as inter-grade ambiguity and class imbalance. ResNet-based classifiers and their derivatives have formed the backbone of most modern architectures, owing to their strong feature extraction capabilities and efficient gradient propagation.^25,34 More sophisticated ensemble frameworks, combining CNN variants such as DenseNet, EfficientNet, and ResNeXt, have been introduced to enhance model robustness and interpretability.^27,28 These ensemble networks leverage complementary feature hierarchies, improving generalization across variable imaging conditions and patient populations.

A key methodological advancement has been the treatment of KL grading as an ordinal regression problem at the model training and analysis stage, rather than as a purely categorical classification task. Ordinal frameworks such as rank-consistent ordinal regression explicitly encode the ordered progression of KL grades within the loss function, penalizing predictions that violate grade hierarchy. In practical terms, this constrains misclassifications to adjacent grades (e.g., KL 1 vs KL 2), rather than distant grades (e.g., KL 0 vs KL 4), thereby reflecting gradual changes in radiographic severity such as incremental osteophyte formation and progressive joint space narrowing. By aligning prediction errors with clinically plausible transitions in radiographic features, ordinal modeling improves diagnostic consistency and more closely mirrors radiologists’ reasoning when interpreting borderline or evolving disease stages.⁹ Complementary to this, attention-based visualization methods such as Grad-CAM,²⁸ eigen-CAM,^30,33 attention maps,³ and feature saliency mapping^28,35 have provided interpretability, revealing that AI models consistently focus on clinically relevant features such as osteophyte formation and joint space narrowing.

Recent studies have also explored multitask learning frameworks that simultaneously analyze multiple osteoarthritic features, including region-specific osteophyte detection, joint space narrowing segmentation (lateral and medial), and overall KL grade, within a unified network enabling more precise differentiation between adjacent severity grades.^37,38 The explicit quantification of radiographic features improved KL grading accuracy by anchoring grade predictions to anatomically and pathologically relevant criteria, thereby reducing ambiguity and misclassification between adjacent severity grades. These architectures not only enhance prediction accuracy but also provide feature-level insights that parallel human radiologic interpretation. Cueva et al.²⁷ extended this concept by integrating radiomic and clinical data through an ensemble fusion network, combining CNN-derived imaging features with patient-level metadata to improve diagnostic precision in borderline cases.

Technological refinements such as transfer learning from large-scale image repositories (e.g., ImageNet) and data augmentation techniques have further improved model robustness, particularly in addressing the challenge of underrepresented KL grades. Additionally, YOLOv3-based architectures have been explored for lesion localization and severity classification, providing real-time detection capabilities suitable for clinical deployment.³⁴ Innovative adaptations such as the plug-in ensemble architecture proposed by Lee et al.³³ utilize each pixel as an independent feature, enabling fine-grained analysis that captures subtle radiographic variations often overlooked by conventional CNNs.

The field has also witnessed a movement toward accessibility and real-world implementation. Lee et al.³⁵ introduced a no-code AI platform employing a ResNet101 backbone for automated KL grading, aimed at democratizing AI use among clinicians without programming expertise. Similarly, studies by Brejnebøl et al.³⁶ and Yoon et al.³⁸ validated the performance of commercial AI tools such as RBKnee v2.1 and MEDI AI-OA in external clinical datasets, underscoring the translational maturity of AI-based KL classification systems.

Meta-analyses by Zhao et al.⁴¹ have consolidated evidence from these diverse models, demonstrating pooled diagnostic accuracies exceeding 85% and highlighting the growing reproducibility and clinical readiness of AI in knee OA assessment. Yaylı et al.²⁹ compared two distinct modeling strategies: a single-model approach, in which CNNs were trained end-to-end to predict KL grades directly from knee radiographs, and a multi-model (feature-decomposed) approach, in which separate CNNs were first trained to detect individual pathological features such as joint space narrowing and osteophyte presence. Outputs from these feature-specific models were subsequently combined with the original radiographic images and basic demographic variables (age and sex) in an integrated prediction framework. Across seven CNN architectures, the single-model approach demonstrated superior accuracy and more stable calibration than the multi-model strategy.

Collectively, these methodological advances mark a paradigm shift in the radiographic assessment of KOA. AI-driven KL grading not only enhances diagnostic efficiency but also offers quantitative, reproducible metrics that could serve as surrogate biomarkers for disease progression and treatment response. As the field matures, the integration of explainable AI mechanisms, federated learning for multi-institutional data sharing, and real-world validation across diverse populations will be pivotal in bridging the gap between experimental performance and clinical adoption.

Datasets and study populations

The success and generalizability of AI models for automated KL grading of KOA depend heavily on the quality and diversity of datasets used for training and validation. Most studies have relied on two well-established, publicly available cohorts, the Osteoarthritis Initiative (OAI) and the Multicenter Osteoarthritis Study (MOST), which collectively underpin the majority of AI-based radiographic KOA research.^{3,8,16,26,28,32,37}

The OAI dataset contains thousands of longitudinal bilateral knee radiographs (ages 45–79 years) with standardized acquisition protocols and expert-assigned KL grades across multiple time points. Similarly, the MOST dataset, comprising 10,052 radiographs from 3026 participants, focuses on the natural history of knee OA with imaging captured at several follow-ups.⁸ These large, high-quality repositories have provided reproducible benchmarks for developing and validating DL models.

Many researchers have used subsets of these datasets tailored to specific aims. For example, Swiecicki et al.¹⁶ utilized 2802 MOST radiographs in a Faster R-CNN architecture, Thomas et al.²⁶ analyzed over 32,000 OAI radiographs for CNN-based KL classification, and Yong et al.⁹ implemented an ordinal regression approach using 4130 OAI images. Modified OAI datasets, cropped and separated by knee side to enhance region-of-interest (ROI) focus have been used by Pi et al.²⁸ and Pongsakonpruttikul et al.³⁴ to reduce background noise and improve feature localization.

Recent efforts emphasize external and multicenter validation to ensure generalizability beyond research-grade data. Vaattovaara et al.³ trained their deep Siamese neural network on MOST, validated it on OAI, and externally tested it on 208 clinical radiographs from Oulu University Hospital. Likewise, Lee et al.³³ validated an ensemble model trained on OAI using 17,040 MOST images, while Wang et al.³⁹ combined OAI with a Taiwanese dataset (FEMH, 246 knees) to assess cross-population transferability. Kondal et al.³² provided a valuable Indian dataset (1043 knees), demonstrating model adaptability to non-Western imaging conditions.

Institutional archives and multicenter datasets further enhance real-world representativeness. Olsson et al.²⁵ incorporated unfiltered hospital radiographs, including images with implants and casts, reflecting authentic clinical variability. Yaylı et al.²⁹ contributed a multicenter dataset of 14,607 annotated radiographs from three hospitals, while Brejnebøl et al.³⁶ established a Picture Archiving and Communication System-based validation workflow for the commercial AI tool RBknee v2.1.

To address dataset imbalance and overfitting, common preprocessing techniques, ROI cropping, histogram equalization, normalization, and resizing, are routinely applied. Data augmentation (rotations, flips, translations, contrast adjustments) and transfer learning from pretrained CNNs (e.g., ImageNet) have improved robustness across imaging conditions.^27,28,37

Overall, the methodological trajectory has evolved from using standardized OAI/MOST datasets toward multicenter, population-diverse, and clinically acquired data, marking a critical step toward the translational readiness of AI-assisted KL grading systems.

Diagnostic model performance

Across the reviewed studies, AI models have demonstrated rapidly improving diagnostic accuracy for grading KOA using the KL system on plain radiographs. The reported models employed a range of architectures from conventional CNNs to advanced ensemble and ordinal regression frameworks, achieving diagnostic performances that increasingly parallel expert musculoskeletal radiologists. The most frequently reported performance metrics include accuracy, sensitivity, specificity, precision, recall, F1-score, AUC, and Cohen’s kappa coefficient. Table 4 presents the detailed performance metrics of the included studies.

Table 4.

Diagnostic performance metrics of included studies.

Author (year) (reference number)	Sensitivity	Specificity	PPV	NPV	AUC	Cohen’s κ	Accuracy	F1-score	Precision	Recall	Mean absolute error
Vaattovaara et al. (2025)³	0.83	0.974	0.96	0.87	0.96	0.82	–	–	–	–	–
Yong et al. (2022)⁹	–	–	–	–	0.8609	–	88.09%	–	–	–	0.33 (DenseNet-161)
Swiecicki et al. (2021)¹⁶	–	–	–	–	0.769	–	71.9%	–	–	–	–
Olsson et al. (2021)²⁵	0.97	0.88	–	–	0.92	–	–	–	–	–	–
Thomas et al. (2020)²⁶	–	–	–	–	0.81–0.89	–	0.872	0.866	0.884	0.849	–
Cueva et al. (2022)²⁷	–	–	–	–	–	–	61.7%	–	–	–	–
Pi et al. (2023)²⁸	–	–	–	–	–	–	76.33%	0.764	0.78	0.75	–
Yaylı et al. (2025)²⁹	–	–	–	–	0.676	–	0.767 (single), 0.740 (multi)	0.763 (single), 0.736 (multi)	0.76	0.767	–
Tariq et al. (2023)³⁰	–	–	–	–	0.98	0.99	0.98	0.97	0.98	0.96	0.02
Mohammed (2023)³¹	–	–	–	–	–	–	0.69–0.89	–	–	–	–
Kondal et al. (2020)³²	0.43–0.96 (grade-wise)	0.61–0.98	–	–	0.94	–	75.6%	0.75	–	–	–
Lee et al. (2024)³³	–	–	–	–	0.66	–	–	0.73	0.73	0.73	0.28
Pongsakonpruttikul et al. (2022)³⁴	55.1%	85.9%	–	–	0.8	–	86.7%	61.1%	68.7%	–	–
Lee et al. (2024)³⁵	0.56–0.61	0.87–0.93	0.58–0.60	0.87–0.93	–	–	0.80–0.89	0.56–0.60	–	–	–
Brejnebøl et al. (2022)³⁶	0.76	0.90	–	–	1.00	–	0.84	0.67	0.73	–	–
Tiulpin and Saarakkala (2020)³⁷	–	0.79–0.95	–	–	0.98	0.82	0.67	–	–	–	–
Yoon et al. (2023)³⁸	–	–	–	–	0.76–0.84	–	–	–	–	–	–
Wang et al. (2022)³⁹	0.92	0.945	0.923	–	0.936	0.80–0.86	78%

Definitions—Accuracy: Proportion of correctly classified KL grades among all predictions. Sensitivity: Ability of the model to correctly identify osteoarthritis-positive cases. Specificity: Ability of the model to correctly identify osteoarthritis-negative cases. Precision: Proportion of predicted positive cases that are truly positive. Recall: Proportion of true positive cases correctly detected by the model. F1-score: Harmonic mean of precision and recall, reflecting their balance. AUC: Area under the receiver operating characteristic curve, indicating overall discriminative performance. Cohen’s κ: Measure of agreement between model predictions and reference standard beyond chance.

AUC, area under the curve; NPV, Negative Predictive Value; PPV, Positive Predictive Value.

Early large-scale validation work by Vaattovaara et al.³ using a deep Siamese neural network trained on the MOST and OAI datasets achieved a high diagnostic accuracy (AUC = 0.967; κ = 0.82) and substantial agreement with four expert readers (κ range = 0.74–0.82). This external validation study demonstrated strong reproducibility across multi-institutional cohorts despite class imbalance. Similarly, Swiecicki et al.¹⁶ used a Faster R-CNN object-detection framework capable of analyzing both PA and lateral projections, reporting a mean weighted κ = 0.77 between the model and five radiologists, surpassing prior detection-based models that analyzed single projections independently.

Models based on ResNet architectures maintained strong discriminative capacity. Olsson et al.²⁵ reported a sensitivity of 97% and a specificity of 88% (AUC = 0.92) using a ResNet classifier trained on a Swedish clinical dataset, with challenges mainly observed in distinguishing KL grades 1 and 2, an ambiguity similarly recognized among human raters. Thomas et al.²⁶ demonstrated high agreement between CNN predictions and expert annotations (κ = 0.81–0.89), equaling or exceeding the highest reported inter-rater agreement (κ = 0.85) in literature, suggesting near-human-level performance.

Advancements in architectural design have improved both granularity and interpretability. Yong et al.⁹ incorporated an Ordinal Regression Module into CNNs, achieving accuracy = 88.1% and AUC = 0.86, effectively minimizing misclassification between adjacent grades, particularly KL 2 and 3. Similarly, Pi et al.²⁸ explored ensemble CNNs (DenseNet-161, ResNet-101, EfficientNet-B5) on 8260 OAI images, demonstrating 76.9% accuracy and F1 = 0.77, with square input images yielding optimal spatial representation. The study also revealed that non-square images (especially tall ones) can distort features and hurt performance, which added a new dimension of research. These findings support that performance variations are driven by population composition, image resolution, and preprocessing choices: compressed or non-standard images reduce accuracy,²⁷ and models trained on standardized OAI images sometimes underperform on routine clinical radiographs due to positioning differences.^26,39

Integrative frameworks combining radiomic and clinical features have shown further promise. Cueva et al.²⁷ developed an ensemble model that achieved high class-wise accuracy (91% for KL 4, 89% for KL 3) with 73% agreement among radiologists, indicating effective model generalization. Similarly, Yaylı et al.²⁹ compared seven single-model CNNs against a multi-model pipeline using 14,607 annotated X-rays, where the single NfNet architecture achieved accuracy = 0.767 and F1 = 0.763, outperforming the multi-model approach.

Other architectures, including Residual Networks (ResNet variants) and YOLOv3 detection frameworks, have also performed competitively. Mohammed et al.³¹ achieved dataset-specific accuracies up to 0.89 with their multi-step ResNet classifier, while Pongsakonpruttikul et al.³⁴ attained AUC = 0.8 and accuracy = 86.7% for three-class OA categorization, emphasizing its practicality in limited-resource settings. Object-detection coupled regression models, such as the two-stage CNN proposed by Kondal et al.,³² demonstrated notable precision gains (0.73) and a low mean absolute error (0.28), particularly when treating KL grades as ordinal rather than nominal variables tested on the external Indian dataset.

Innovative ensemble strategies have further refined prediction stability. Lee et al.³³ implemented four plug-in modules (EfficientNet and Swin variants) on OAI and MOST datasets, demonstrated high overall performance (AUC = 0.94; accuracy = 75.6%), performing exceptionally in severe OA (Grade 4, sensitivity = 0.96) achieving in a comparable effort. Tiulpin et al.³⁷ used a dual SE-ResNet/SE-ResNeXt ensemble, achieving AUC = 0.95–0.98 and κ = 0.82, with superior performance for moderate-to-severe OA detection. In addition to predicting overall KL grades, Tiulpin et al.³⁷ and Yoon et al.³⁸ explicitly quantified key radiographic features underpinning KL grading, including regional osteophyte formation (medial and lateral femur and tibia) and compartment-specific joint space narrowing. By decomposing KL scoring into its constituent radiographic components, these models reduced ambiguity between adjacent grades, particularly KL 1–3 and achieved improved discriminative performance, with AUCs ranging from 0.95 to 0.98.

Real-world deployment studies have underscored the feasibility of clinical use. Brejnebøl et al.³⁶ externally validated the commercial AI tool Radiobotics RBknee v2.1 on 99 Danish knees, obtaining accuracy = 97.8% and perfect intra-rater agreement (κ = 1.0), closely mirroring consultant radiologists’ performance. Similarly, Wang et al.³⁹ demonstrated high cross-population reproducibility using a deep CNN (AUC = 0.936; κ = 0.81–0.86) on OAI and Taiwanese datasets, effectively identifying surgical candidates (KL 3–4) even across differing imaging protocols.

Finally, accessibility-focused innovations have begun to emerge. Lee et al.³⁵ developed a no-code ResNet101-based platform (DEEP:PHI), allowing non-programmers to train and validate AI models. Though validation metrics (training AUC = 0.89; validation AUC = 0.80) were moderate, the approach highlights the growing democratization of healthcare AI.

Overall, the collective diagnostic evidence underscores that CNN-based and ensemble DL models can reliably perform automated KL grading with accuracy values ranging from 0.75 to 0.98, AUCs up to 0.98, and inter-rater agreements (κ) between 0.67 and 0.86, comparable to, and in some cases exceeding, human expert performance. Collectively, these results demonstrate that contemporary AI models not only perform comparably to expert readers but also generalize well to independent hospital datasets. This external reproducibility signifies growing readiness for clinical adoption, especially for triage, second-opinion reporting, and longitudinal disease tracking.

Discussion

This discussion synthesizes key findings from the reviewed studies, contextualizing advances in AI-based KL grading in relation to methodological design, dataset characteristics, and clinical applicability. Despite significant progress in automated KL grading of KOA, several recurring limitations temper the clinical readiness and generalizability of current AI models. Across studies, four major constraints stand out: observer-dependent variability in ground truth labels, dataset imbalance and sampling bias, restricted external validation limiting generalization, and ethical issues.

Ground-truth variability and labeling subjectivity

A fundamental challenge across nearly all reviewed studies lies in the subjectivity of KL grading itself. Even among expert readers, the inter- and intra-observer variability remains substantial, κ values between 0.65 and 0.85 are commonly reported. This variability propagates into AI model training, as networks learn from inconsistent labels. Across included studies, definitions of KL grade 2 varied; therefore, this review reports results as presented by original authors rather than enforcing a uniform redefinition. Rather, this variability is highlighted as a central limitation and a key target for future consensus-building efforts aimed at standardizing radiographic criteria for KL grading. Studies referenced^25,27,28,33 explicitly acknowledged that the ambiguity in mid-grade categories (KL 1–2) contributed to higher misclassification rates, mainly because of the vaguer or confusing definitions represented for those. Similarly, Vaattovaara et al.³ reported that their model’s inter-rater agreement (κ = 0.74–0.82) was comparable to the variation among human readers, underscoring that AI performance is bounded by the inconsistency of its experienced radiologist-derived reference standard.

Dataset imbalance, diversity, and external validation

Dataset imbalance represents another key limitation. Many studies trained on the OAI and MOST datasets, where mild or moderate OA grades dominate, while advanced OA and normal knees are underrepresented. For instance, Mohammed et al.³¹ observed severe class imbalance (3857 images in grade 0 vs only 295 in grade 4), leading to overfitting toward prevalent classes. Although data augmentation strategies, such as rotation, flipping, and brightness normalization, were employed in several studies^9,26 to mitigate imbalance, they cannot fully replicate the morphological diversity of underrepresented categories. Moreover, the continued reliance on OAI and MOST restricts demographic diversity: both datasets predominantly include middle-aged to older adults from Western populations (predominantly the United States population) with standardized imaging protocols, with limited ethnic and anatomical variability. This raises concerns about geographic, ethnic, and clinical diversity underlying model training, potentially limiting generalizability to broader clinical settings where patient characteristics, radiographic acquisition parameters, and disease prevalence differ. Another critical issue is data standardization and image quality. Differences in acquisition angles, beam positioning, and knee alignment may distort JSW, altering AI predictions.^26,39 While transfer learning and augmentation techniques were employed to counteract this, few studies performed true multicenter harmonization.

A related issue is the generalizability of trained models beyond controlled research environments. Only a few studies conducted rigorous external validations.^3,32,36,39 These efforts revealed that models often perform well on in-domain data but degrade when exposed to images from new institutions or imaging systems. Brejnebøl et al.³⁶ demonstrated strong transferability of a commercial AI system (RBknee v2.1), yet cautioned that population homogeneity (predominantly European ethnicity) may limit extrapolation to global cohorts. Similarly, Wang et al.³⁹ noted that image quality, nonstandard AP positioning, and hardware variability across centers affected diagnostic consistency. Though external validation sets are present, often smaller and derived from single institutions, further compounding concerns about real-world robustness.

To address these gaps, future studies should prioritize the inclusion of multi-site, multi-hardware datasets encompassing varied imaging environments and underrepresented populations. Recent work in allied AI imaging domains has underscored that dataset heterogeneity enhances model resilience to domain shifts and mitigates performance degradation when deployed outside the original training context.⁴² Furthermore, collaborative benchmarking efforts across international cohorts can support a more comprehensive assessment of diagnostic tools and ensure that performance claims are not confounded by population or scanner biases.

Methodological limitations and overfitting risks

Technical limitations also persist. Studies employing complex ensemble frameworks^10,28,33 achieved superior metrics but at the expense of computational efficiency and interpretability pose a critical barrier for real-world integration. The “black-box” nature of deep networks remains a concern, with few studies providing saliency maps or Grad-CAM explanations to validate model reasoning. Also, hardware constraints, annotation variability, and limited open-access clinical validation studies restrict scalability across healthcare environments.^29,35,37 Additionally, Olsson et al.²⁴ and Lee et al.³⁵ emphasized that models trained solely on radiographs overlook relevant clinical correlates (e.g., pain scores, BMI, or function), which may limit their prognostic utility.

The differing findings reported by Yaylı et al.²⁹ and by Tiulpin et al.³⁷ and Yoon et al.³⁸ primarily reflect methodological differences in how feature decomposition was implemented rather than a contradiction in principle. Yaylı et al. showed that an end-to-end single-model CNN outperformed a multi-model framework in which separately trained feature detectors (osteophytes and joint space narrowing) and demographic variables were fused, likely due to suboptimal feature integration, redundancy, and error propagation from auxiliary models, leading to increased complexity and overfitting. In contrast, Tiulpin et al. and Yoon et al. embedded feature decomposition within tightly integrated, quantitatively defined frameworks, explicitly modeling region-specific osteophyte formation and compartment-specific joint space narrowing using normalized and clinically grounded measures. Notably, the use of relative joint space narrowing metrics and structured or ordinal learning reduced ambiguity between adjacent KL grades and improved discriminative performance. Collectively, these findings suggest that feature decomposition enhances KL grading only when radiographic features are rigorously quantified and integratively modeled; otherwise, simpler end-to-end approaches may yield more stable performance.

Overfitting remains a persistent methodological challenge in AI-based radiographic analysis, occurring when models learn dataset-specific patterns that fail to generalize beyond the training distribution. This risk is particularly pronounced in studies where training and evaluation data share similar imaging protocols or originate from the same institutional sources. Such conditions may lead to overly optimistic performance estimates that do not reflect real-world clinical deployment. None of the reviewed studies reported formal bias–variance decomposition or explicit quantitative measures of the overfitting–generalizability trade-off. Instead, overfitting was assessed indirectly through discrepancies between internal validation and external testing performance, calibration degradation, or reduced accuracy when models were applied to data from different institutions or imaging protocols. Studies that included external validation consistently demonstrated performance attenuation on out-of-domain datasets, suggesting variance-dominated behavior in models trained on homogeneous data sources. The absence of standardized reporting on bias–variance balance limits direct comparison of model robustness across studies and highlights an important methodological gap in the current literature.

To enhance robustness and generalizability, several methodological safeguards are warranted. First, the use of multi-institutional training and independent external testing cohorts can better capture variability in patient anatomy, disease presentation, and imaging hardware. Second, more rigorous cross-validation frameworks, including nested and repeated k-fold validation, should be adopted to reduce bias arising from chance data partitioning.⁴² Incorporation of regularization strategies and uncertainty estimation methods, including Bayesian approaches, can further mitigate overconfident predictions, particularly when models encounter out-of-distribution samples. Additionally, domain adaptation techniques may help align feature representations across heterogeneous imaging environments, reducing sensitivity to site-specific artifacts. Finally, transparent reporting of patient-level and institution-level dataset separation is essential to prevent inadvertent data leakage that can artificially inflate performance metrics. Similar recommendations have been emphasized in broader ML literature, underscoring the importance of rigorous validation practices for reproducible and clinically reliable AI systems.

Ethical, interpretability, and deployment limitations

Beyond algorithmic performance, the clinical translation of AI-based KL grading systems requires careful consideration of ethical, interpretability, and operational challenges. Interpretability is another critical determinant of clinical acceptance. Models that provide transparent outputs such as attention maps or explicit quantification of radiographic features like osteophytes or joint space narrowing are more likely to gain clinician trust, as these outputs can be directly correlated with familiar radiographic signs. Equally important are practical deployment considerations, including seamless integration with existing tools, intuitive user interfaces, and minimal disruption to established radiology workflows. Multi-disciplinary collaboration between clinical experts, AI scientists, and product managers is critical in implementing a Responsible AI.⁴³

Infrastructure and scalability further influence real-world adoption, particularly in resource-constrained settings where computational capacity and maintenance support may be limited. Moreover, transparency challenges associated with proprietary or commercial AI systems can hinder independent evaluation of safety and efficacy. Clear disclosure of training data characteristics, input and output handling methods, validation strategies, subgroup-specific performance metrics, and error analysis should therefore be considered essential.⁴⁴ Finally, continuous post-deployment auditing is necessary to monitor performance drift, emerging biases, and unintended consequences as clinical environments and patient populations evolve.⁴³

Collectively, these limitations highlight a persistent gap between algorithmic performance and clinical applicability. While CNN-based models now rival expert readers in KL grading accuracy, they continue to depend on imperfect training data and limited real-world validation. Addressing these challenges will require (1) standardized, multi-reader consensus labeling frameworks to minimize annotation bias, (2) balanced and demographically diverse datasets to improve equity and robustness, (3) transparent model reporting standards, including explainability metrics and cross-institutional validation, and (4) ethical, interpretable, and scalable solutions to ensure trustworthy deployment in clinical workflows.

Future directions

Future research should emphasize methodological rigor, diversity, and clinical relevance rather than incremental performance gains. Large-scale, multi-institutional datasets spanning diverse populations, imaging hardware, and acquisition protocols are essential to improve robustness and equity in AI-based KL grading. Given that the KL score is fundamentally a radiographic severity index, future AI efforts should preserve its image-based diagnostic nature and focus on improving grading consistency, interpretability, and reproducibility rather than incorporating non-imaging clinical variables into the grading process itself. Importantly, while clinical variables such as BMI, pain scores, functional impairment, and prior joint injury are critical for patient management and prognostication, their integration is more appropriately situated within downstream clinical or epidemiologic models that use AI-derived KL grades as standardized inputs, rather than as components of KL score computation. This separation avoids construct contamination and mitigates risks of collinearity when similar variables are subsequently adjusted for in disease progression or outcome prediction models. In this context, multimodal frameworks may support comprehensive osteoarthritis risk stratification pipelines, while preserving the conceptual and methodological integrity of KL grading as a diagnostic measure.^32,34,37 Expanding datasets through federated learning, allowing decentralized model training across institutions, will enhance diversity and reduce bias while preserving data privacy.³⁵

To reduce overfitting, future studies should adopt stronger validation frameworks, including independent external testing, clear reporting of dataset separation, and uncertainty estimation. Particular attention should be given to borderline KL grades, where labeling ambiguity and misclassification are most pronounced.

The adoption of explainable AI frameworks such as Grad-CAM and attention heatmaps can bolster clinician trust and regulatory acceptance by visualizing AI decision pathways. To avoid misclassifications of middle grades because of vaguer definitions and adoption of different variations for same grade, a consensus process is suggested to develop clear, modified definitions, especially for grades 1 and 2.^13,22 To ensure sustained clinical relevance, models must undergo longitudinal validation to assess their capacity to predict OA progression and treatment outcomes.^25,35,39 The emergence of no-code AI platforms (e.g., Lee et al.³⁵) can democratize model development for clinicians without programming expertise, facilitating translation into real-world healthcare workflows.

Finally, successful translation will require attention to ethical deployment and human–AI interaction. AI systems should be positioned as decision-support tools rather than replacements for expert judgment, with continuous post-deployment auditing to ensure consistent performance across patient subgroups. Addressing these considerations will be critical to realizing clinically trustworthy, scalable, and equitable AI-assisted KL grading systems.^36,39,44

Ultimately, the future of AI-assisted KL grading lies in clinically explainable, ethically grounded, and technically robust systems validated across diverse populations and imaging standards. Such advances will enable precision stratification of osteoarthritis severity, streamline radiographic interpretation, and foster equitable access to AI-driven musculoskeletal care worldwide.

Conclusion

AI has redefined the landscape of radiographic KOA assessment by automating the KL grading system with remarkable accuracy and reproducibility. The reviewed evidence demonstrates that CNN-based and ensemble DL models can perform KL grading with diagnostic metrics equivalent to expert readers, often achieving accuracies above 85% and AUC values nearing 0.98. Methodological innovations, such as ordinal regression, multi-task learning, and ensemble architectures, have refined model interpretability and minimized misclassification across disease stages.

Yet, despite these advances, real-world implementation remains constrained by the subjectivity of KL labels, dataset imbalance, and limited cross-population validation. The overreliance on Western-centric datasets such as OAI and MOST restricts demographic representation and model fairness. Furthermore, the “black-box” nature of deep models continues to challenge interpretability and clinician trust.

Future progress demands a shift toward standardized, explainable, and ethically sound AI systems validated across diverse, multicenter cohorts. The incorporation of multicenter data, federated learning for privacy-preserving collaboration, and human–AI collaborative frameworks will bridge the gap between algorithmic precision and clinical reality. As accessibility increases through no-code platforms and open-source validation frameworks, AI-assisted KL grading stands poised to revolutionize musculoskeletal diagnostics, delivering precision, equity, and efficiency in KOA care worldwide.

Footnotes

Acknowledgements

None.

Declarations

ORCID iDs

Saumya Rawat

Binit Vaidya

Hemalatha Shanmugam

Lavanya Airen

Trial registration number/date

Not applicable.

Grant number

Not applicable.

References

Long

Liu

Yin

, et al. Prevalence trends of site-specific osteoarthritis from 1990 to 2019: findings from the global burden of disease study 2019. Arthritis Rheumatol 2022; 74(7): 1172–1183.

Langworthy

Dasa

Spitzer

AI.

Knee osteoarthritis: disease burden, available treatments, and emerging options. Ther Adv Musculoskelet Dis 2024; 16: 1759720X241273009.

Vaattovaara

Panfilov

Tiulpin

, et al. Kellgren–Lawrence grading of knee osteoarthritis using deep learning: diagnostic performance with external dataset and comparison with four readers. Osteoarthr Cartil Open 2025; 7(2): 100580.

Dell’Isola

Recenti

Giardulli

, et al. Osteoarthritis year in review 2025: epidemiology and therapy. Osteoarthritis Cartilage 2025; 33(11): 1300–1306.

Tan

, et al. Global burden and socioeconomic impact of knee osteoarthritis: a comprehensive analysis. Front Med (Lausanne) 2024; 11: 1323091.

GBD 2021 Osteoarthritis Collaborators. Global, regional, and national burden of osteoarthritis, 1990–2020 and projections to 2050: a systematic analysis for the Global Burden of Disease Study 2021. Lancet Rheumatol 2023; 5(9): e508–e522.

Weng

Chen

Jiang

, et al. Global burden of early-onset osteoarthritis, 1990–2019: results from the Global Burden of Disease Study 2019. Ann Rheum Dis 2024; 83(7): 915–925.

Kohn

Sassoon

Fernando

ND.

Classifications in brief: Kellgren–Lawrence classification of osteoarthritis. Clin Orthop Relat Res 2016; 474(8): 1886–1893.

Yong

Teo

Murphy

, et al. Knee osteoarthritis severity classification with ordinal regression module. Multim Tools Appl 2022; 81(29): 41497–41509.

10.

Fukui

Yamane

Ishida

, et al. Relationship between radiographic changes and symptoms or physical examination findings in subjects with symptomatic medial knee osteoarthritis: a three-year prospective study. BMC Musculoskelet Disord 2010; 11: 269.

11.

Hayes

Kittelson

Loyd

, et al. Assessing radiographic knee osteoarthritis: an online training tutorial for the Kellgren–Lawrence Grading Scale. MedEdPORTAL 2016; 12: 10503.

12.

Kellgren

Lawrence

JS.

Radiological assessment of osteo-arthrosis. Ann Rheum Dis 1957; 16(4): 494–502.

13.

Choi

Kim

, et al. Association of radiographic structure deformity phenotypes of knee OA to clinical symptoms and risk for progression: proposing a modification of Kellgren–Lawrence grade—data from the Osteoarthritis Initiative and the MOST study. Osteoarthr Cartil Open 2025; 7: 100566.

14.

Mohammadi

Salehi

Jahanshahi

, et al. Artificial intelligence in osteoarthritis detection: A systematic review and meta-analysis. Osteoarthritis Cartilage 2024; 32(3): 241–253.

15.

Kellgren

Lawrence

JS.

Rheumatism in miners. II. X-ray study. Br J Ind Med 1952; 9: 197–207.

16.

Swiecicki

O’Donnell

, et al. Deep learning-based algorithm for assessment of knee osteoarthritis severity in radiographs matches performance of radiologists. Comput Biol Med 2021; 133: 104334.

17.

Wright

; MARS Group. Osteoarthritis classification scales: interobserver reliability and arthroscopic correlation. J Bone Joint Surg Am 2014; 96: 1145–1151.

18.

Felson

Naimark

Anderson

, et al. The prevalence of knee osteoarthritis in the elderly. Arthritis Rheum 1987; 30: 914–918.

19.

Damen

Schiphof

Wolde

, et al. Inter-observer reliability for radiographic assessment of early osteoarthritis features: the CHECK (cohort hip and cohort knee) study. Osteoarthritis Cartilage 2014; 22(7): 969–974.

20.

Scott

Lethbridge-Cejku

Reichle

, et al. Reliability of grading scales for individual radiographic features of osteoarthritis of the knee. Invest Radiol 1993; 28: 497–501.

21.

Gossec

Jordan

Mazzuca

, et al.; OARSI-OMERACT Task Force “Total Articular Replacement as Outcome Measure in OA.” Comparative evaluation of three semi-quantitative radiographic grading techniques for knee osteoarthritis in terms of validity and reproducibility in 1759 X-rays: report of the OARSI-OMERACT task force. Osteoarthritis Cartilage 2008; 16: 742–748.

22.

Jordan

Luta

Stabler

, et al. Ethnic and sex differences in serum levels of cartilage oligomeric matrix protein: the Johnston County Osteoarthritis Project. Arthritis Rheum 2003; 48(3): 675–681.

23.

Hart

Spector

Brown

, et al. Clinical signs of early osteoarthritis: reproducibility and relation to x ray changes in 541 women in the general population. Ann Rheum Dis 1991; 50(7): 467–470.

24.

Williams

Farrell

Cunningham

, et al. Knee pain and radiographic osteoarthritis interact in the prediction of levels of self-reported disability. Arthritis Rheum 2004; 51(4): 558–561.

25.

Olsson

Akbarian

Lind

, et al. Automating classification of osteoarthritis according to Kellgren–Lawrence in the knee using deep learning in an unfiltered adult population. BMC Musculoskelet Disord 2021; 22(1): 844.

26.

Thomas

Kidziński

Halilaj

, et al. Automated classification of radiographic knee osteoarthritis severity using deep neural networks. Radiol Artif Intell 2020; 2(2): e190065.

27.

Cueva

Castillo

Espinós-Morató

, et al. Detection and classification of knee osteoarthritis. Diagnostics (Basel) 2022; 12(10): 2362.

28.

Lee

, et al. Ensemble deep-learning networks for automated osteoarthritis grading in knee X-ray images. Sci Rep 2023; 13(1): 22887.

29.

Yayli

Kılıç

Beyaz

Deep learning in gonarthrosis classification: a comparative study of model architectures and single vs multi-model methods. Front Artif Intell 2025; 8: 1413820.

30.

Tariq

Suhail

Nawaz

Knee osteoarthritis detection and classification using x-rays. IEEE Access 2023; 11: 48292–48303.

31.

Mohammed

Hasanaath

Latif

, et al. Knee osteoarthritis detection and severity classification using residual neural networks on preprocessed X-ray images. Diagnostics 2023; 13(8): 1380.

32.

Kondal

Kulkarni

Kharat

, et al. Automatic grading of knee osteoarthritis on the Kellgren–Lawrence Scale from radiographs using convolutional neural networks. ArXiv, abs/20048572, 2020.

33.

Lee

Song

Han

, et al. Accurate, automated classification of radiographic knee osteoarthritis severity using a novel method of deep learning: plug-in modules. Knee Surg Relat Res 2024; 36(1): 24. (Erratum in: Knee Surg Relat Res 2025; 37(1): 17.)

34.

Pongsakonpruttikul

Angthong

Kittichai

, et al. Artificial intelligence assistance in radiographic detection and classification of knee osteoarthritis and its severity: a cross-sectional diagnostic study. Eur Rev Med Pharmacol Sci 2022; 26(5): 1549–1558.

35.

Lee

Yun

, et al. Automated diagnosis of knee osteoarthritis using ResNet101 on a DEEP:PHI: leveraging a no-code AI platform for efficient and accurate medical image analysis. Diagnostics (Basel) 2024; 14(21): 2451.

36.

Brejnebøl

Hansen

Bachmann

, et al. External validation of an artificial intelligence tool for radiographic knee osteoarthritis severity classification. Eur J Radiol 2022; 150: 110249.

37.

Tiulpin

Saarakkala

Automatic grading of individual knee osteoarthritis features in plain radiographs using deep convolutional neural networks. Diagnostics (Basel) 2020; 10(11): 932.

38.

Yoon

Yon

Lee

, et al. Assessment of a novel deep learning-based software developed for automatic feature extraction and grading of radiographic knee osteoarthritis. BMC Musculoskelet Disord 2023; 24(1): 869.

39.

Wang

Huang

Zhu

, et al. Successful real-world application of an osteoarthritis classification deep-learning model using 9210 knees—an orthopedic surgeon’s view. J Orthop Res 2023; 41: 737–746.

40.

Tiulpin

Thevenot

Rahtu

, et al. Automatic knee osteoarthritis diagnosis from plain radiographs: a deep learning-based approach. Sci rep 2018; 8(1): 1727.

41.

Zhao

Zhang

, et al. The value of deep learning-based X-ray techniques in detecting and classifying K-L grades of knee osteoarthritis: a systematic review and meta-analysis. Eur Radiol 2025; 35(1): 327–340.

42.

Samee

El-Kenawy

Atteia

, et al. Metaheuristic optimization through deep learning classification of COVID-19 in chest X-ray images. Comput Mater Contin 2022; 73(2): 4193–4210.

43.

Alelyani

A validated framework for responsible AI in healthcare autonomous systems. Sci Rep 2025; 15: 44432.

44.

Liu

Cruz Rivera

Moher

, et al. Reporting guidelines for clinical trial reports for interventions involving artificial intelligence: the CONSORT-AI extension. Nat Med 2020; 26: 1364–1374.