Automated Objective Scoring of Osteoarthritis Severity in Mouse Medial Tibial Cartilage Using Deep Learning

Abstract

Objective

This study developed and optimized a deep learning model to automate OARSI-based histologic scoring of mouse medial tibial cartilage.

Design

Safranin-O-stained cartilage images were obtained from mice with OA induced by medial meniscus instability. A total of 2,788 images from 1,000 knees of 520 mice were included for model development and evaluation. Each data set was evaluated using deep learning models with multiple selection criteria for the reference standard. For preprocessing, horizontal cartilage alignment was learned using a VGG16-based regression model, and the tibial cartilage region was detected using YOLO-v7. These learned weights were applied to generate a rotation- and crop-adjusted cartilage data set, which was used to evaluate three CNNs.

Results

Initial classification of 544 mice cartilage images showed low accuracy, leading to an expansion of the data set to 2,788 images. An algorithm was applied to align the images horizontally and crop only the joint region, thereby reducing misclassification of noncartilaginous regions. This approach significantly improved the accuracy of cartilage degradation scoring. Among the deep learning models evaluated, VGG16 showed the best performance, achieving an MAE of 0.33. The model also recorded a precision of 0.680 (95% CI: 0.668–0.693), recall of 0.645 (95% CI: 0.627–0.664), F1-score of 0.653 (95% CI: 0.636–0.671), and accuracy of 0.648 (95% CI: 0.631–0.665).

Conclusion

In this study, the VGG16 model showed high concordance with expert assessments, suggesting the feasibility of automating OA grading from histological images in large-scale animal studies.

Keywords

osteoarthritis articular cartilage histology basic cartilage and bone imaging research using artificial intelligence

Introduction

Osteoarthritis (OA) is a chronic degenerative joint disease characterized by the progressive breakdown of articular cartilage, leading to joint dysfunction, pain, and reduced mobility. Despite its high prevalence, the precise etiology of OA remains unclear, and effective treatments are still under development.^1,2 Due to the difficulty of obtaining cartilage tissue from patients, much research relies on animal models particularly joint tissues or cartilage cells to investigate potential therapies. Mouse models are widely used in arthritis research, including studies of both OA and rheumatoid arthritis (RA).^3
-5 Current imaging modalities lack sufficient resolution or soft tissue contrast to assess cartilage in mouse models directly. Consequently, cartilage damage is typically evaluated using histological sections. Although histological analysis provides detailed visualization of joint tissues and their components, it is time-consuming, can damage samples, is limited to two dimensions, and is subject to observer interpretation.⁶

The Osteoarthritis Research Society International (OARSI) and Mankin scoring systems are the primary methods for assessing OA severity in animal models. Both evaluate cartilage degradation but differ in methodology. The OARSI system was developed to address limitations of the Mankin score, which is less reliable for early-stage OA due to its complexity. The OARSI score is widely preferred for its simplicity, proven effectiveness, broad acceptance, and its ability to yield valuable insights into OA pathogenesis and therapeutic interventions.^7
-10 The OARSI system, which uses Safranin-O-staining to assess structural cartilage changes, such as loss of staining, fissures, and surface erosion. Initially introduced for human OA in 2006, it was adapted in 2010 for standardized evaluation of OA lesions and histological changes in mice and rabbits with surgically induced OA.^11,12 Despite its advantages, the OARSI score still depends on manual assessment, which is inherently subjective and influenced by the scorer’s expertise. This semi-quantitative approach can lead to interrater variability, requiring extensive scorer training and sometimes blind testing within the same institution. Moreover, validation often requires multiple tissue samples and comprehensive analyses, resulting in considerable time and cost burdens.

Deep learning has emerged as a promising solution for automating complex image-based evaluations.^13,14 Deep learning algorithms, such as convolutional neural networks (CNNs), have enabled automatic extraction and analysis of information from histological images. Recent advances have demonstrated that two-step deep learning approaches can significantly improve diagnostic accuracy by detecting and classifying specific image regions. Several research groups have developed deep learning methods to evaluate joint damage in patients and animal models. For example, deep learning models have been used to assess the severity of knee OA from radiographs and to improve joint damage assessment by enabling more accurate bone segmentation and 3D modeling.¹⁵ Attention-based CNN models have also been implemented to automatically predict joint damage in patients with RA using radiographs.¹⁶ In addition, deep learning techniques have been used to automate histological grading of engineered cartilage tissues.¹⁷ However, despite these advancements, deep learning methods for fully automating OARSI scoring are not yet available.

Therefore, this study aimed to automate the classification of OA severity in animal models using deep learning. We hypothesized that training CNN models with histological images of OA in animal models could overcome the limitations of manual scoring and enable objective, reproducible classification. To achieve this, we applied various representative CNN models visual geometry group (VGG16),¹⁸ residual network (ResNet50),¹⁹ and extreme inception (Xception)²⁰ independently trained on each data set. This approach aimed to minimize errors inherent to manual evaluation and provide consistent and objective assessments.

Material and Methods

Study Design

This study was a retrospective analysis performed to automate the classification of OA severity in mouse cartilage using Safranin-O-stained histological images. Preprocessed images were input into a VGG16 CNN model, trained against consensus scores as the reference standard. The VGG16 model extracted image features and classified OARSI grades through fully connected layers. Model performance was evaluated using standard performance metrics, and the overall study design is illustrated in Figure 1A .

Figure 1.

Study design. (A) The proposed approach employs deep learning techniques, specifically the VGG16 model, to assess OA severity from preprocessed images. Image workflow and data set composition for model performance evaluation. (B) A total of 3,000 Safranin-O-stained cartilage images were obtained from 1,000 knees of 520 mice and divided by mouse ID into training (2,100 images, 364 mice), validation (450 images, 78 mice), and test (450 images, 78 mice) sets. After excluding 212 low-quality images, 2,788 images were used for analysis. Grading was performed based on consensus among three expert raters. The large data set was subdivided into multiple experimental subsets, and the corresponding training, validation, and testing split ratios are indicated. (C) The small and large data sets were split into training (70%), validation (15%), and testing (15%) sets. (D) The rotation data set was used entirely (100%) for training, while the crop data set was divided into training (80%) and validation (20%). The cartilage-rotation-crop data set generated through the preprocessing pipeline was split into training (70%), validation (15%), and testing (15%) sets for model evaluation.

Image Acquisition

All cartilage images used in this study were obtained from previous research conducted at Chonnam National University and Gwangju Institute of Science and Technology (GIST).^9,10,21
-26 A total of 1,000 leg specimens were collected from 520 mice (bilateral knees from 480 mice and unilateral knees from 40 mice), and Safranin-O-stained cartilage images were acquired from each leg. The knee joint cartilage was divided into three regions (anterior, middle, and posterior), and ten 5-µm sections were prepared for each region. One representative slide was selected from each region, resulting in three cartilage images per leg. These images represent distinct anatomical areas, each of which can exhibit varying degrees of cartilage damage. Therefore, assigning the same grade to all images from the same animal was not feasible. A total of 3,000 images were initially collected and divided into training (70%), validation (15%), and test (15%) sets based on mouse ID. As a result, the training set included 364 mice (2,100 images), while the validation and test sets included 78 mice each (450 images per set). Subsequently, images were otherwise unsuitable for analysis were proportionally excluded based on the initial data split ratio. We then excluded 212 images based on predefined quality criteria (e.g., motion/defocus blur, section tearing or folding, staining failure/uneven staining, missing or severely truncated cartilage, and other artifacts that precluded reliable grading). Through this process, we ensured that images derived from the same mouse were not included across different data sets (i.e., training, validation, or test), thereby maintaining strict separation by mouse identity. After this filtering step, a total of 2,788 cartilage images were included in the final analysis (Fig. 1B). We adhered to the ARRIVE guidelines and have included the ARRIVE checklist in the Supplementary Material.

Data Set Composition

The small data set consisted of 544 images randomly selected from the total of 2,788 images, and was divided into 388 training images, 78 validation images, and 78 test images; this small data set was used as a pilot study to evaluate feasibility and identify failure modes. The large data set included all 2,788 images and was split into 1,952 training images, 418 validation images, and 418 test images, and this large data set was used for all primary experiments and final reporting. The training set was used to learn image patterns, the validation set to prevent overfitting, and the test set to evaluate model performance. Three of the four experts who participated in the interrater variability assessment assigned OARSI scores to all 2,788 images, and the reference standard was determined by majority voting. In cases without a majority agreement, the final score was determined through consensus among the three raters (Fig. 1C).

Image Preprocessing

Image preprocessing comprised sequential rotation and cropping to standardize orientation and focus on tibial cartilage. This explicit geometric preprocessing was adopted to prioritize preservation of native histological texture and avoid unnecessary pixel-level transformations. Rotation standardized image orientation, and subsequent regions of interest (ROI) cropping reduced irrelevant background while retaining grading-relevant tibial cartilage morphology. First, to address variations in image orientation, 397 images were selected from the total of 2,788 images based on their wide range of inclination angles. For each selected image, the inclination angle of the cartilage was calculated by connecting its two endpoints and measuring the angle relative to the horizontal axis (allowing both positive and negative values). Each image was then rotated to align the cartilage horizontally. These rotated images were used to train a VGG16-based regression model to predict the inclination angle of cartilage in unprocessed images. The learned weights were subsequently applied to the entire data set to predict the inclination angle relative to the horizontal axis, and each cartilage image was rotated accordingly to produce a cartilage-rotation data set. The mean absolute error (MAE) was used as the loss function for the regression model predicting the cartilage angle. Next, an object detection data set was created by labeling bounding boxes on 337 rotated images to crop only the tibial cartilage region. Of these, 273 images were used for training and 64 for validation. Object detection was performed using the deep learning model you only look once (YOLO-v7). YOLO can detect multiple objects within a single image by dividing the input into a grid and directly predicting bounding boxes and class probabilities for each grid cell, achieving high detection accuracy with rapid inference.²⁷ This two-stage preprocessing pipeline rotation followed by cropping resulted in a final cartilage rotation-crop data set containing only the tibial cartilage regions, and the entire process was fully automated (Fig. 1D). To ensure the reliability of this automated process, the robustness of the cropping pipeline was validated on an independent test set (n = 64), achieving a high mean Average Precision (mAP) of 0.765 (Suppl. Table S1). To ensure the rigorous independence of the evaluation, all images used in the two-stage preprocessing pipeline—including the 397 images for the rotation model and 337 images for the YOLO-v7 model—were sampled exclusively from the training split (70% of the total 2,788 images). This strict isolation guaranteed that the validation (15%) and test (15%) sets remained entirely unseen during the development of the preprocessing pipeline, thereby preventing any potential data leakage. This modular two-step approach separated alignment from ROI extraction, standardizing inputs for detection and enabling stage-wise error analysis with minimal pixel-level transformation.

Deep Learning Model

Using the prepared data set, we trained the following deep learning classification models. To classify the constructed data set, we employed well-established CNN models including the VGG16, ResNet50, and Xception, due to their proven performance in image classification and regression tasks and their ability to effectively handle complex visual patterns. The VGG16 model features a simple and uniform architecture consisting of 16 weight layers and small convolutional filters (3 × 3) that enable the capture of fine-grained features. The ResNet50 model addresses the vanishing gradient problem using a 50-layer CNN based on residual learning. The Xception model, based on the Inception architecture, leverages extreme separable convolutions for greater efficiency. For the small data set and the tibia-crop data set, training was performed using the VGG16 model only. For the large data set and the cartilage-rotation-crop data set, we trained the VGG16, ResNet50, and Xception models and compared their performance. Data augmentation techniques including zooming in/out, brightness adjustment, and horizontal flipping were applied depending on the data set. Classification was performed by selecting the class with the highest predicted probability from the final softmax layer of each model. Stochastic gradient descent (SGD) was used as the optimizer (learning rate = 0.001, momentum = 0.9), and categorical cross-entropy was applied as the loss function. To prevent overfitting, we added a dropout layer (rate = 0.5) after the fully connected layer of each model. Because each model had a different learning rate and convergence behavior, training was stopped when the validation performance or loss no longer improved, resulting in 30 to 100 training epochs.

Analysis of Learning/Loss Curves and Confusion Matrices

Learning and loss curves were used to evaluate model training and generalization performance, with the X-axis representing epochs and the Y-axis showing loss (left) and accuracy (right). These curves allowed us to assess convergence and to detect potential overfitting or underfitting. To analyze classification performance, a confusion matrix was generated to visualize the Top-N (Top-1 and Top-2) predictions on the test sets for both the small and large data sets. Diagonal elements indicate correct classifications, whereas off-diagonal elements represent misclassifications.

Performance Evaluation

The performance of each model was evaluated by precision (positive predictive value), recall (sensitivity), F1-score, and accuracy, as follows:

Precision = TP/(TP + FP)

Recall = TP/(TP + FN)

F1-score = 2 × (Precision × Recall)/(Precision + Recall)

Accuracy = (TP + TN)/(TP + TN + FP + FN)

Additionally, to complement performance assessment in the multi-class classification setting, Top-1 accuracy (where the top prediction matches the true label) and Top-2 accuracy (where one of the top two predictions matches the true label) were calculated. A multi-class evaluation approach was used because the classification involved six or eight classes, and all calculations were performed using Python’s sklearn.metrics module.²⁸ For the final preprocessed data set, 95% confidence intervals (CIs) were estimated to provide a measure of statistical uncertainty.

Receiver Operating Characteristic (ROC) Curve and Precision-Recall (PR) Curve Analysis

To evaluate the performance of the deep learning models, we utilized ROC curves and area under the curve (AUC), as well as PR curves and average precision (AP) metrics. The ROC curve was used to assess the trade-off between sensitivity and specificity, while the PR curve provided insight into precision and recall, particularly for imbalanced data. To evaluate the classification performance across the eight grades, we employed a one-vs-rest (OvR) strategy for both ROC and PR curve analyses. Overall model performance was summarized using macro-averaged AUC and AP to give equal weight to each grade. For the macro-averaged no-skill baselines are indicated by red dashed lines in Figure 5D-F . Detailed class-specific results are provided in Suppl. Table S2 .

Visualization of Model Decisions With Gradient-Weighted Class Activation Mapping (Grad-CAM)

To visually explain which regions the model focused on when making predictions, we applied Grad-CAM.²⁹ The method generates coarse localization maps by visualizing the gradients flowing into the final convolutional layer, thereby highlighting image regions that contribute most to the predicted class.

Robustness Assessment of the Model on Cropped Images

From the original data set of 2,788 images, 10 images were randomly selected from each grade. To assess robustness, each image was modified to include variations in the angle and size of cartilage regions and processed using the trained VGG16 model. For each image, nine versions were generated: the original image and eight differently cropped variants (crop 1-8). This resulted in 90 images per grade. The consistency of the VGG16 model’s predictions across these transformed images was then quantified.

Statistical Analysis

To quantify interrater variability in OARSI scoring, Fleiss’ Kappa values were calculated to measure agreement among multiple expert raters. To assess the reliability of both model performance metrics and rater assessments, 95% CIs were computed, providing a statistical range within which the true values are likely to fall. All statistical analyses were performed using Python’s statsmodels and sklearn.metrics modules.

Model Availability

The final model developed in this study, which was evaluated on the test data, is available online at https://github.com/esfman-git/Osteoarthritis_grading

Results

Interrater Variability in OARSI Cartilage Grading

To evaluate the interrater reliability of the OARSI grading system in assessing cartilage degradation, four independent investigators graded the same Safranin-O-stained tissue samples. Despite the use of a standardized scoring system, differences in observation points and subjective interpretation resulted in inconsistencies among raters (Fig. 2A and B; Suppl. Fig. S1). To quantify this variability, Fleiss’ Kappa analysis was performed across all OARSI grades. The analysis revealed substantial disagreement, particularly in the lower grades (0.5–4). For instance, κ values were 0.315 (95% CI: 0.233–0.434) for grade 0.5, 0.245 (95% CI: 0.163–0.327) for grade 1, and only 0.154 (95% CI: 0.090–0.218) for grade 3. In contrast, higher agreement was observed at the extremes, with κ = 0.595 (95% CI: 0.525–0.665) for grade 0 and κ = 0.815 (95% CI: 0.717–0.913) for grade 6 (Fig. 2C). Based on these observations, we next applied a deep learning approach to assess whether automated scoring could address the inconsistencies seen in manual grading.

Figure 2.

Limitations of OA severity assessment based using the OARSI grading system. (A) Different scores assigned to the same sample by four expert raters (P1-P4) based on their subjective evaluations of cartilage loss severity. (B) Discrepancies in severity ratings among the four expert raters for representative images classified by grade. (C) Quantification of the kappa statistic indicates the degree of agreement in assessment results among raters, with higher values indicating greater consistency. The results are visualized as a heat map.

Application of Deep Learning Using Small Data Set

To mitigate the subjectivity associated with the OARSI scoring system, a VGG16-based classification model was trained on a small data set. Following training, the model’s learned weights were applied to a separate test set for prediction. The training curves indicated that the validation loss did not converge to 0, and the validation accuracy did not reach 1 (Fig. 3A). According to the confusion matrix, only 12 out of 24 images in class 0 were correctly classified, with similar patterns of misclassification observed across other classes (Fig. 3B). The overall classification accuracy on the small data set was 0.526 (Fig. 3B; Suppl. Table S3).

Figure 3.

Comparison of prediction accuracy among three CNN models and expert raters on data sets of different sizes. (A) Training curve showing the performance of the VGG16 model trained on a small data set (X-axis: Number of Epochs; Left Y-axis: Loss; Right Y-axis: Accuracy). (B) Confusion matrix illustrating the performance of the VGG16 model in assessing OA severity on the small data set. Confusion matrices comparing the prediction performance of three CNN models (C) VGG16, (D) ResNet50, (E) Xception on the large data set, highlighting differences in classification accuracy and model behavior. (F) Grad-CAM visualizations providing visual explanations of the VGG16 model’s predictions.

Deep Learning Model Performance on a Large Data Set

To reduce misclassification errors potentially caused by insufficient training data, we next increased the number of images used for training by expanding the data set. Training curves for loss and accuracy revealed that VGG16 exhibited a stable learning pattern, while ResNet50 and Xception showed irregular validation loss and accuracy (Suppl. Fig. S2A-F). Among the three CNN models, VGG16 achieved the best performance, with a precision of 0.63, recall of 0.59, and an F1-score of 0.58. The corresponding classification accuracy was 0.59 (Fig. 3C-E; Suppl. Table S4). Despite the large training data set, only marginal improvements in classification accuracy were observed. In many misclassified cases, the model identified regions of interest that did not correspond to the articular cartilage (Fig. 3F).

Deep Learning Model Performance on a Tibia-Crop Data Set

A tibia-crop data set was generated by labeling and cropping only the tibial cartilage region using the YOLO-v7 model. The VGG16 model was trained on this data set; however, no significant improvement in classification performance was observed compared to the model trained on the large data set containing the full cartilage region. Specifically, the model achieved a precision of 0.54, recall of 0.57, and an F1-score of 0.53 and an overall accuracy of 0.54. All of which were lower than those obtained with the large data set (Suppl. Fig. S3; Suppl. Table S5). Grad-CAM visualizations revealed that portions of the cartilage were excluded during the cropping process in some images, depending on the image angle (Suppl. Fig. S4).

Performance Evaluation of VGG16 Model Using Cartilage-Rotation-Crop Data Set

Cartilage images were first rotated to a horizontal orientation using a VGG16-based regression model, then cropped to include only the cartilage region using YOLO-v7-based object detection (Fig. 4A). The resulting cartilage-rotation-crop data set was used to evaluate the classification performance of VGG16, ResNet50, and Xception. Compared to the large data set, all three CNN models demonstrated improved classification performance, as evidenced by greater alignment along the diagonal of the confusion matrix (Fig. 4B-D). Precision, recall, and F1-score were also improved, with overall classification accuracy ranging from 0.584 to 0.648. Among the three CNN models, VGG16 achieved the most robust performance with an MAE of 0.33 and the highest accuracy at 0.648 (95% CI: 0.631–0.665), outperforming its performance on the tibia-crop data set (Table 1). Consistent with these findings, an ablation analysis further supported the rotation-to-cropping preprocessing order as the best-performing strategy (Suppl. Tables S6 and S7). Grad-CAM visualizations confirmed that the cartilage-rotation-crop data set effectively localized the cartilage region for severity classification (Fig. 4E). To further assess model performance, ROC and PR curve analyses were performed (Fig. 5). VGG16 achieved an AUC of 0.943 (95% CI: 0.933–0.953) and an AP of 0.712 (95% CI: 0.662–0.761) (Fig. 5A and D). ResNet50 achieved an AUC of 0.914 (95% CI: 0.897–0.931) and an AP of 0.656 (95% CI: 0.603–0.709) (Fig. 5B and E), while Xception achieved an AUC of 0.914 (95% CI: 0.897–0.930) and an AP of 0.619 (95% CI: 0.564–0.673) (Fig. 5C and F). To compare model performance with that of human raters, the false positive rate (FPR), true positive rate (TPR), precision, and recall were calculated for each grader based on consensus OARSI scores. These metrics were plotted as individual markers (Expert A: ○ green; Expert B: □ blue; Expert C: ∆ red) on the ROC (Fig. 5A-C) and PR (Fig. 5D-F) curves of each model. VGG16 exhibited prediction metrics that closely aligned with those of individual human raters. To further assess model robustness, performance was evaluated on a data set with differently cropped images. The VGG16 model showed comparable accuracy between the original data set (0.648, 95% CI: 0.631–0.665) and the cropped variants (0.636, 95% CI: 0.601–0.671), and the mean κ value of 0.760 (95% CI: 0.698–0.821) indicated substantial agreement, suggesting that minor variations in cartilage region cropping did not substantially affect classification outcomes (Fig. 6).

Figure 4.

A comparison of learning effects of three CNN models on cartilage-rotation-crop data set obtained through preprocessing of original images. (A) The preprocessing pipeline combining the VGG16-based regression model and YOLO-v7 to isolate the cartilage region from the original image, followed by rotation and cropping to normalize the input data for training. Confusion matrices showing the prediction performance of (B) the VGG16, (C) ResNet50, and (D) Xception models trained on the cartilage-rotation-crop data set, enabling direct comparison of their classification accuracy. (E) Grad-CAM visualization illustrating regions of interest used by the VGG16 model trained on the cartilage-rotation-crop data set.

Table 1.

Classification Report for the Cartilage-Rotation-Crop Data Set (0.631–0.665).

Set	Grade	VGG16			Support
Set	Grade	Precision	Recall	F1	Support
Cartilage-rotation-crop data set	0	0.723 (0.682-0.765)	0.667 (0.607-0.726)	0.691 (0.670-0.712)	65
	0.5	0.629 (0.588-0.669)	0.676 (0.608-0.745)	0.648 (0.620-0.676)	88
	1	0.524 (0.441-0.607)	0.541 (0.404-0.679)	0.521 (0.448-0.593)	48
	2	0.616 (0.585-0.646)	0.635 (0.494-0.777)	0.617 (0.540-0.694)	58
	3	0.660 (0.629-0.663)	0.691 (0.671-0.712)	0.674 (0.649-0.699)	68
	4	0.650 (0.567-0.663)	0.550 (0.386-0.714)	0.572 (0.460-0.683)	42
	5	0.816 (0.732-0.900)	0.752 (0.676-0.829)	0.774 (0.760-0.789)	37
	6	0.858 (0.808-0.908)	0.650 (0.447-0.853)	0.727 (0.585-0.869)	12
	Mean	0.680 (0.668-0.693)	0.645 (0.627-0.664)	0.653 (0.636-0.670)	Σ=418
	Accuracy	0.648 (0.631-0.665)
	MAE	0.33
Set	Grade	ResNet50			Support
Set	Grade	Precision	Recall	F1	Support
Cartilage-rotation-crop data set	0	0.643 (0.577-0.708)	0.698 (0.625-0.772)	0.666 (0.623-0.708)	65
	0.5	0.652 (0.605-0.699)	0.556 (0.504-0.607)	0.598 (0.553-0.643)	88
	1	0.463 (0.397-0.529)	0.458 (0.389-0.526)	0.459 (0.404-0.513)	48
	2	0.550 (0.493-0.607)	0.670 (0.609-0.730)	0.601 (0.569-0.633)	58
	3	0.633 (0.600-0.665)	0.598 (0.484-0.711)	0.608 (0.541-0.675)	68
	4	0.604 (0.550-0.659)	0.508 (0.460-0.556)	0.549 (0.519-0.578)	42
	5	0.722 (0.696-0.748)	0.795 (0.763-0.826)	0.756 (0.733-0.778)	37
	6	0.775 (0.670-0.879)	0.716 (0.574-0.858)	0.743 (0.619-0.867)	12
	Mean	0.630 (0.605-0.656)	0.625 (0.593-0.656)	0.622 (0.592-0.652)	Σ=418
	Accuracy	0.609 (0.587-0.631)			Σ=418
Set	Grade	Xception			Support
Set	Grade	Precision	Recall	F1	Support
Cartilage-rotation-crop data set	0	0.624 (0.565-0.683)	0.704 (0.651-0.757)	0.662 (0.607-0.717)	65
	0.5	0.624 (0.583-0.666)	0.565 (0.533-0.597)	0.593 (0.559-0.627)	88
	1	0.441 (0.415-0.468)	0.442 (0.382-0.501)	0.441 (0.404-0.477)	48
	2	0.537 (0.415-0.468)	0.598 (0.519-0.677)	0.564 (0.523-0.604)	58
	3	0.605 (0.594-0.617)	0.627 (0.558-0.697)	0.614 (0.581-0.646)	68
	4	0.528 (0.454-0.602)	0.502 (0.431-0.574)	0.513 (0.454-0.572)	42
	5	0.711 (0.640-0.781)	0.583 (0.557-0.610)	0.641 (0.599-0.682)	37
	6	0.713 (0.647-0.780)	0.634 (0.523-0.745)	0.670 (0.580-0.760)	12
	Mean	0.598 (0.574-0.621)	0.582 (0.554-0.610)	0.587 (0.560-0.614)	Σ=418
	Accuracy	0.584 (0.567-0.608)			Σ=418

TP = True Positive; FP = False Positive; FN = False Negative; TN = True Negative.

Precision = TP/(TP + FP).

Recall = TP/(TP + FN).

F1-score = 2 × (Precision × Recall)/(Precision + Recall).

Accuracy = (TP + TN)/(TP + TN + FP + FN).

MAE = Mean Absolute Error.

The 95% confidence intervals are given in parentheses.

Figure 5.

ROC and PR curves for OA severity classification using three CNN models (VGG16, ResNet50, Xception). (A-C) ROC curves for each model, illustrating the trade-off between the TPR and FPR across OA severity grades. The corresponding AUC values indicate the classification performance for each grade. (D-F) PR curves for each model, showing the relationship between precision and recall for different OA severity grades. The AP values provide a quantitative measure of overall classification performance across all severity levels. The red dashed line represents the no-skill baseline (macro average). The performance of human raters is indicated by individual markers (Expert A: 〇 green; Expert B: ☐ blue; Expert C: ∆ red) on the ROC and PR curves.

Figure 6.

Robustness of the VGG16 model to variations in image cropping. Images graded as G1 were used to generate multiple cropped variants (Crop 1–8). The VGG16 model showed consistent classification accuracy between the cartilage-rotation-crop data set (Accuracy: 0.648, 95% CI: 0.631–0.665) and the cropped variants (Accuracy: 0.636, 95% CI: 0.601–0.671). The Kappa value of 0.760 (95% CI: 0.698–0.821) indicates substantial agreement, supporting the model’s robustness to minor variations in cartilage region cropping.

Addressing Grading Ambiguity With a Top-2 Prediction Approach

To address ambiguity near the boundaries between grading categories, a Top-2 prediction approach was employed. Using this scheme, the true values in the confusion matrices for all three CNN models were more concentrated along the diagonal compared to those obtained using the standard Top-1 prediction method (Suppl. Fig. S6). Precision, recall, and F1-score also improved under the Top-2 approach. Among the tested models, VGG16 achieved the highest performance, with a precision of 0.98, recall of 0.96, F1-score of 0.97, and an overall accuracy of 0.97 (Suppl. Table S8).

Discussion

This study developed and evaluated CNN models for automated histologic scoring of osteoarthritis severity in mouse cartilage using the OARSI criteria. Among the tested architectures, VGG16 demonstrated the most consistent and accurate performance.

Our results revealed significant interrater variability, particularly for OARSI grades 0.5-4 (Fig. 2C), highlighting potential limitations of subjective cartilage evaluation. This finding is consistent with previous studies reporting inconsistencies in low-grade classifications using radiological grading systems such as the Kellgren–Lawrence (KL) score.³⁰ These observations underscore the need for an objective and automated grading system to improve consistency and reliability. To address this issue, we employed a CNN-based deep learning model combined with cartilage rotation and cropping techniques. Among the models tested, VGG16 outperformed the others (Fig. 4B-D; Table 1), suggesting that preprocessing steps such as cartilage alignment and region-based feature extraction contributed to more accurate cartilage identification. These findings align with previous reports highlighting the importance of preprocessing for improving classification accuracy.^31,32

This study supports preclinical research applicability of deep learning-based OA scoring systems by demonstrating performance comparable with that of trained human raters. The VGG16 model showed the highest similarity to expert grading, with an AUC of 0.943 and an AP of 0.712 (Fig. 5). These results confirm that deep learning models deliver robust performance for OA scoring despite minor misclassifications, reinforcing their potential value in preclinical research.³³ In addition, we confirmed that the VGG16 model was particularly robust to minor preprocessing variations arising from cartilage region cropping (Fig. 6). This finding suggests that the deep learning model can tolerate such variability within an acceptable margin, thereby enhancing its applicability in real experimental settings. Such robustness further supports the potential of deep learning-based scoring systems to provide reliable and reproducible assessments across data sets obtained under varying imaging conditions.

Moreover, to provide a more rigorous and objective validation of this two-stage preprocessing pipeline, a manual audit of the independent test set (n = 418) was performed, revealing a 94.0% success rate for the YOLO-v7-based cropping. Notably, the classification error rate in the failed-crop subgroup (60.0%, 15/25) was 2-fold higher than in the successful-crop subgroup (27.2%, 107/393) (Suppl. Table S8). This significant disparity proves that precise ROI localization is a critical determinant of diagnostic accuracy, suggesting that the performance gains in this study were fundamentally driven by the automated rotation-and-crop pipeline—which preserves grade-relevant morphological features—rather than the CNN architecture alone.

When assessing disease severity, such as quantifying cartilage condition according to the OARSI criteria, the boundaries between grades can often be ambiguous, resulting in challenges for accurate evaluation and variability in results. This issue can also arise when deep learning models classify multiple classes.^33,34 To account for this inherent grading ambiguity, we explored an alternative evaluation method by considering the top two predicted classes instead of relying solely on the top prediction as the reference standard. While Top-2 classification yielded higher numerical values for accuracy and other performance metrics (Suppl. Fig. S6; Suppl. Table S9), this does not reflect a true enhancement in model capability. Instead, it accounts for the intrinsic ambiguity of the OARSI grading system, where adjacent grades are often difficult to distinguish even for experts. By permitting two plausible predictions, the Top-2 approach provides a more realistic framework for evaluating model performance under conditions that better mirror human grading variability. Despite these improvements, 11 out of 418 cases showed discrepancies between the predicted and actual scores. Grad-CAM analysis revealed that misclassifications resulted from inaccurate cartilage identification or exclusion of damaged areas, similar to errors made by human raters (Suppl. Fig. S5). This suggests that grading ambiguity affects both human experts and artificial intelligence (AI) models. Nevertheless, with an overall accuracy of 0.97, these misclassifications fall within an acceptable margin of error, demonstrating the high reliability of our deep learning-based system.³⁵ A potential concern is that consensus labels may encode shared rater scoring habits, particularly in lower-grade OA where label uncertainty is common. We therefore stratified the independent test set (OARSI 0.5–4) by agreement level and evaluated performance separately. Accuracy was similar for full agreement (3/3; accuracy: 0.790) and partial agreement (2/3; accuracy: 0.800) (Suppl. Table S10), suggesting that residual errors likely reflect intrinsic grading ambiguity and label uncertainty rather than strong dependence on a shared scoring style.

This study has several limitations. First, the training data set consisted exclusively of coronal sections. Future research should explore preprocessing methods for cartilage images from various orientations to improve generalizability. Second, classification accuracy may be affected by image resolution, as most misclassifications occur in low-resolution images. Therefore, it is crucial to collect images with sufficiently high resolution (Suppl. Table S11). Third, the data set was derived from only two laboratories, which may limit its diversity. Incorporating samples with varying staining intensities and image characteristics is essential for broader applicability. Fourth, the image classification framework in this study was limited to CNN–based models. Although CNNs demonstrated strong performance, future studies should evaluate and compare alternative architectures, such as transformer-based or hybrid models, to further improve classification robustness and accuracy. Fifth, this study did not consider other factors such as subchondral bone plate thickness, osteophyte maturity, and synovitis, which may influence OA progression. Future work should integrate these parameters for a more comprehensive evaluation. Finally, future studies could explore weakly supervised learning using objective proxy labels, such as ‘weeks post-surgery and sex’. This approach may overcome the inherent subjectivity of OARSI grading by identifying morphological features of disease progression independently of human-defined labels.

In conclusion, this study successfully automated a histological scoring system for cartilage evaluation in mouse models of OA using deep learning. This approach offers several advantages. First, it enables semi-quantitative assessment of arthritis severity without manually processing large volumes of cartilage images. Second, our deep learning algorithm can objectively evaluate the condition of the mouse knee. Third, while the model fundamentally outputs a single OARSI grade per image, applying the Top-2 prediction approach led to a substantial improvement in grading accuracy and consistency, demonstrating its effectiveness in addressing grading ambiguity and reducing subjectivity in human assessments. An automated histological scoring system for OA is expected to advance OA research using animal models.

Supplemental Material

sj-pdf-1-car-10.1177_19476035261435798 – Supplemental material for Automated Objective Scoring of Osteoarthritis Severity in Mouse Medial Tibial Cartilage Using Deep Learning

Supplemental material, sj-pdf-1-car-10.1177_19476035261435798 for Automated Objective Scoring of Osteoarthritis Severity in Mouse Medial Tibial Cartilage Using Deep Learning by Ka Hyon Park, Young-Gwon Kim, Gyuseok Lee, Su-Jin Kim, Yoonkyung Won, Yun Hyun Huh, Tae-Jong Kim, Jang-Soo Chun, Ho-Jun Song and Je-Hwang Ryu in CARTILAGE

Footnotes

ORCID iDs

Ka Hyon Park

Young-Gwon Kim

Je-Hwang Ryu

Author Contributions

Ka Hyon Park: Conceptualization, Data curation, Formal analysis, Investigation, Methodology, Software, Visualization, Writing – original draft.

Young-Gwon Kim: Data curation, Formal analysis, Investigation, Methodology, Software, Visualization, Writing – original draft.

Gyuseok Lee: Data curation, Validation.

Su-Jin Kim: Data curation, Validation.

Yoonkyung Won: Data curation, Writing – review & editing.

Yun Hyun Huh: Data curation, Writing – review & editing.

Tae-Jong Kim: Data curation, Writing – review & editing.

Jang-Soo Chun: Data curation, Resources, Writing – review & editing.

Ho-Jun Song: Conceptualization, Investigation, Methodology, Software, Writing – review & editing.

Je-Hwang Ryu: Conceptualization, Funding acquisition, Methodology, Project administration, Resources, Supervision, Writing – review & editing.

Ka Hyon Park and Young-Gwon Kim are joint first authors.

Je-Hwang Ryu and Ho-Jun Song contributed equally to this work.

Funding

The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: Supported by the National Research Foundation of Korea (NRF) grants by the Korean government (MSIT) (2019R1A5A2027521, 2021R1A2C3005727, and 2022R1I1A1A01072371) and a grant of Chonnam National University Hospital Biomedical Research Institute (BCRI25057).

Declaration of Conflicting Interests

The authors declared the following potential conflicts of interest with respect to the research, authorship, and/or publication of this article: Ka Hyon Park reports: National Research Foundation of Korea (NRF) grants funded by the Korean government (MSIT) (Grant No. 2022R1I1A1A01072371), related to this study. Yoonkyung Won reports: National Research Foundation of Korea (NRF) grant funded by the Korean government (MSIT) (Grant No. 2019R1A5A2027521), related to this study. Tae-Jong Kim reports: a grant from the Chonnam National University Hospital Biomedical Research Institute (Grant No. BCRI25057), related to this study. Je-Hwang Ryu reports: National Research Foundation of Korea (NRF) grants funded by the Korean government (MSIT) (Grant Nos. 2019R1A5A2027521, 2021R1A2C3005727) and a grant from the Chonnam National University Hospital Biomedical Research Institute (Grant No. BCRI25057), all related to this study.

Data Availability Statement

The data that support the findings for this study are available to other researchers from the corresponding author upon reasonable request.

Supplemental Material

Supplementary material for this article is available on the Cartilage website at .

References

Sinusas

Osteoarthritis: diagnosis and treatment. Am Fam Physician. 2012;85(1):49-56.

Dieppe

Lohmander

LS.

Pathogenesis and management of pain in osteoarthritis. Lancet Lond Engl. 2005;365(9463):965-73. doi:10.1016/S0140-6736(05)71086-2.

Caplazi

Baca

Barck

Carano

RAD

DeVoss

Lee

, et al. Mouse models of rheumatoid arthritis. Vet Pathol. 2015;52:819-26. doi:10.1177/0300985815588612.

Drevet

Favier

Brun

Gavazzi

Lardy

Mouse models of osteoarthritis: a summary of models and outcomes assessment. Comp Med. 2022;72:3-13. doi:10.30802/AALAS-CM-21-000043.

Vincent

Williams

Maciewicz

Silman

Garside

; Arthritis Research UK Animal Models Working Group. Mapping pathogenesis of arthritis through small animal models. Rheumatology. 2012;51(11):1931-41. doi:10.1093/rheumatology/kes035.

Pastoureau

Hunziker

Pelletier

JP.

Cartilage, bone and synovial histomorphometry in animal models of osteoarthritis. Osteoarthritis Cartilage. 2010;18 Suppl 3:S106-S112. doi:10.1016/j.joca.2010.05.024.

Pritzker

KPH

Gay

Jimenez

Ostergaard

Pelletier

J-P

Revell

, et al. Osteoarthritis cartilage histopathology: grading and staging. Osteoarthritis Cartilage. 2006;14(1):13-29. doi:10.1016/j.joca.2005.07.014.

Pauli

Whiteside

Heras

Nesic

Koziol

Grogan

, et al. Comparison of cartilage histopathology assessment systems on human knee joints at all stages of osteoarthritis development. Osteoarthritis Cartilage. 2012;20(6):476-85. doi:10.1016/j.joca.2011.12.018.

Son

Park

Kwak

Won

Choi

Rhee

, et al. Estrogen-related receptor γ causes osteoarthritis by upregulating extracellular matrix-degrading enzymes. Nat Commun. 2017;8:2133. doi:10.1038/s41467-017-01868-8.

10.

Yang

Youngnim

Kim

Chun

JS.

Prokineticin 2 is a catabolic regulator of osteoarthritic cartilage destruction in mouse. Arthritis Res Ther. 2023;25:236. doi:10.1186/s13075-023-03206-4.

11.

Glasson

Chambers

Van Den Berg

Little

CB.

The OARSI histopathology initiative - recommendations for histological assessments of osteoarthritis in the mouse. Osteoarthritis Cartilage. 2010;18 Suppl 3:S17-S23. doi:10.1016/j.joca.2010.05.025.

12.

Laverty

Girard

Williams

Hunziker

Pritzker

KP.

The OARSI histopathology initiative - recommendations for histological assessments of osteoarthritis in the rabbit. Osteoarthritis Cartilage. 2010;18 Suppl 3:S53-S65. doi:10.1016/j.joca.2010.05.029.

13.

Chatfield

Simonyan

Vedaldi

Zisserman

Return of the devil in the details: delving deep into convolutional nets. arXiv. 2014. doi:10.48550/arXiv1405.3531.

14.

Krizhevsky

Sutskever

Hinton

GE.

ImageNet classification with deep convolutional neural networks. Commun ACM. 2017;60:84-90. doi:10.1145/3065386.

15.

Tiribilli

Bocchi

Deep learning-based workflow for bone segmentation and 3D modeling in cone-beam CT orthopedic imaging. Appl Sci. 2024;14(17):7557. doi:10.3390/app14177557.

16.

Chaturvedi

DeepRA: predicting joint damage from radiographs using CNN with attention. arXiv. 2022. doi:10.48550/arXiv.2102.06982.

17.

Power

Acevedo

Yamashita

Rubin

Martin

Barbero

Deep learning enables the automation of grading histological tissue engineered cartilage images for quality control standardization. Osteoarthritis Cartilage. 2021;29(3):433-43. doi:10.1016/j.joca.2020.12.018.

18.

Simonyan

Zisserman

Very deep convolutional networks for large-scale image recognition. arXiv. 2015. doi:10.48550/arXiv.1409.1556.

19.

Zhang

Ren

Sun

Deep residual learning for image recognition. arXiv. 2015. doi:10.48550/arXiv.1512.03385.

20.

Chollet

Xception: deep learning with depthwise separable convolutions. arXiv. 2017. doi:10.48550/arXiv.1610.02357.

21.

Lee

Yang

Kim

Tran

Lee

Park

, et al . Enhancement of intracellular cholesterol efflux in chondrocytes leading to alleviation of osteoarthritis progression. Arthritis Rheumatol. 2025;77(2):151-62. doi:10.1002/art.42984.

22.

Shin

Kwak

Kim

Chun

JS.

Fibroblast growth factor 7 (FGF7) causes cartilage destruction, subchondral bone remodeling, and the premature growth plate closure in mice. Osteoarthritis Cartilage. 2025;33(4):426-36. doi:10.1016/j.joca.2024.11.010.

23.

Lee

Kim

Kwak

Park

Chun

JS.

The cereblon-AMPK (AMP-activated protein kinase) axis in chondrocytes regulates the pathogenesis of osteoarthritis. Osteoarthritis Cartilage. 2024;32(12):1579-90. doi:10.1016/j.joca.2024.08.009.

24.

Shin

Cho

Kim

Chun

JS.

STING mediates experimental osteoarthritis and mechanical allodynia in mouse. Arthritis Res Ther. 2023;2590. doi:10.1186/s13075-023-03075-x.

25.

Choi

Lee

Song

Koh

Yang

Kwak

, et al. The CH25H–CYP7B1–RORα axis of cholesterol metabolism regulates osteoarthritis. Nature. 2019;566(7743):254-8.

26.

Kim

Jeon

Shin

Won

Lee

Kwak

, et al. Regulation of the catabolic cascade in osteoarthritis by the zinc-ZIP8-MTF1 axis. Cell. 2014;156:730-43. doi:10.1016/j.cell.2014.01.007.

27.

Wang

Bochkovskiy

Liao

HYM

. YOLOv7: trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. arXiv. 2022. doi:10.48550/arXiv.2207.02696.

28.

3.4. Metrics scoring: quantifying the quality of predictions. 2024. https://scikit-learn.org/stable/modules/model_evaluation.html

29.

Selvaraju

Cogswell

Das

Vedantam

Parikh

Batra

Grad-CAM: visual explanations from deep networks via gradient-based localization. Int J Comput Vis. 2020;128(2):336-59. doi:10.1007/s11263-019-01228-7.

30.

Hayes

Kittelson

Loyd

Wellsandt

Flug

Stevens-Lapsley

Assessing radiographic knee osteoarthritis: an online training tutorial for the Kellgren-Lawrence grading scale. Mededportal J Teach Learn Resour. 2016;12:10503. doi:10.15766/mep_2374-8265.10503.

31.

Takahashi

Matsubara

Uehara

Data augmentation using random image cropping and patching for deep CNNs. IEEE Trans Circuits Syst Video Technol. 2020;30:2917-31. doi:10.1109/TCSVT.2019.2935128.

32.

Remmelzwaal

Mishra

Ellis

GFR

. Human eye inspired log-polar pre-processing for neural networks. 2020 International SAUPEC/RobMech/PRASA Conference; 2020 Jan 29-31; Cape Town, South Africa. New York: IEEE; 2020.

33.

Bargal

Zunino

Petsiuk

Zhang

Saenko

Murino

, et al. Guided zoom: zooming into network evidence to refine fine-grained model decisions. IEEE Trans Pattern Anal Mach Intell. 2021;43(11):4196-202. doi:10.1109/TPAMI.2021.3054303.

34.

Sawada

Kaneko

Sagi

Trade-offs in top-k classification accuracies on losses for deep learning. arXiv. 2020. doi:10.48550/arXiv.2007.15359.

35.

Stidham

Liu

Bishu

Rice

Higgins

PDR

Zhu

, et al. Performance of a deep learning model vs human reviewers in grading endoscopic disease severity of patients with ulcerative colitis. JAMA Netw Open. 2019;2:e193963. doi:10.1001/jamanetworkopen.2019.3963.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.65 MB

0.00 MB