Abstract
Objective
This study developed and optimized a deep learning model to automate OARSI-based histologic scoring of mouse medial tibial cartilage.
Design
Safranin-O-stained cartilage images were obtained from mice with OA induced by medial meniscus instability. A total of 2,788 images from 1,000 knees of 520 mice were included for model development and evaluation. Each data set was evaluated using deep learning models with multiple selection criteria for the reference standard. For preprocessing, horizontal cartilage alignment was learned using a VGG16-based regression model, and the tibial cartilage region was detected using YOLO-v7. These learned weights were applied to generate a rotation- and crop-adjusted cartilage data set, which was used to evaluate three CNNs.
Results
Initial classification of 544 mice cartilage images showed low accuracy, leading to an expansion of the data set to 2,788 images. An algorithm was applied to align the images horizontally and crop only the joint region, thereby reducing misclassification of noncartilaginous regions. This approach significantly improved the accuracy of cartilage degradation scoring. Among the deep learning models evaluated, VGG16 showed the best performance, achieving an MAE of 0.33. The model also recorded a precision of 0.680 (95% CI: 0.668–0.693), recall of 0.645 (95% CI: 0.627–0.664), F1-score of 0.653 (95% CI: 0.636–0.671), and accuracy of 0.648 (95% CI: 0.631–0.665).
Conclusion
In this study, the VGG16 model showed high concordance with expert assessments, suggesting the feasibility of automating OA grading from histological images in large-scale animal studies.
Keywords
Introduction
Osteoarthritis (OA) is a chronic degenerative joint disease characterized by the progressive breakdown of articular cartilage, leading to joint dysfunction, pain, and reduced mobility. Despite its high prevalence, the precise etiology of OA remains unclear, and effective treatments are still under development.1,2 Due to the difficulty of obtaining cartilage tissue from patients, much research relies on animal models particularly joint tissues or cartilage cells to investigate potential therapies. Mouse models are widely used in arthritis research, including studies of both OA and rheumatoid arthritis (RA).3 -5 Current imaging modalities lack sufficient resolution or soft tissue contrast to assess cartilage in mouse models directly. Consequently, cartilage damage is typically evaluated using histological sections. Although histological analysis provides detailed visualization of joint tissues and their components, it is time-consuming, can damage samples, is limited to two dimensions, and is subject to observer interpretation. 6
The Osteoarthritis Research Society International (OARSI) and Mankin scoring systems are the primary methods for assessing OA severity in animal models. Both evaluate cartilage degradation but differ in methodology. The OARSI system was developed to address limitations of the Mankin score, which is less reliable for early-stage OA due to its complexity. The OARSI score is widely preferred for its simplicity, proven effectiveness, broad acceptance, and its ability to yield valuable insights into OA pathogenesis and therapeutic interventions.7 -10 The OARSI system, which uses Safranin-O-staining to assess structural cartilage changes, such as loss of staining, fissures, and surface erosion. Initially introduced for human OA in 2006, it was adapted in 2010 for standardized evaluation of OA lesions and histological changes in mice and rabbits with surgically induced OA.11,12 Despite its advantages, the OARSI score still depends on manual assessment, which is inherently subjective and influenced by the scorer’s expertise. This semi-quantitative approach can lead to interrater variability, requiring extensive scorer training and sometimes blind testing within the same institution. Moreover, validation often requires multiple tissue samples and comprehensive analyses, resulting in considerable time and cost burdens.
Deep learning has emerged as a promising solution for automating complex image-based evaluations.13,14 Deep learning algorithms, such as convolutional neural networks (CNNs), have enabled automatic extraction and analysis of information from histological images. Recent advances have demonstrated that two-step deep learning approaches can significantly improve diagnostic accuracy by detecting and classifying specific image regions. Several research groups have developed deep learning methods to evaluate joint damage in patients and animal models. For example, deep learning models have been used to assess the severity of knee OA from radiographs and to improve joint damage assessment by enabling more accurate bone segmentation and 3D modeling. 15 Attention-based CNN models have also been implemented to automatically predict joint damage in patients with RA using radiographs. 16 In addition, deep learning techniques have been used to automate histological grading of engineered cartilage tissues. 17 However, despite these advancements, deep learning methods for fully automating OARSI scoring are not yet available.
Therefore, this study aimed to automate the classification of OA severity in animal models using deep learning. We hypothesized that training CNN models with histological images of OA in animal models could overcome the limitations of manual scoring and enable objective, reproducible classification. To achieve this, we applied various representative CNN models visual geometry group (VGG16), 18 residual network (ResNet50), 19 and extreme inception (Xception) 20 independently trained on each data set. This approach aimed to minimize errors inherent to manual evaluation and provide consistent and objective assessments.
Material and Methods
Study Design
This study was a retrospective analysis performed to automate the classification of OA severity in mouse cartilage using Safranin-O-stained histological images. Preprocessed images were input into a VGG16 CNN model, trained against consensus scores as the reference standard. The VGG16 model extracted image features and classified OARSI grades through fully connected layers. Model performance was evaluated using standard performance metrics, and the overall study design is illustrated in

Study design.
Image Acquisition
All cartilage images used in this study were obtained from previous research conducted at Chonnam National University and Gwangju Institute of Science and Technology (GIST).9,10,21
-26 A total of 1,000 leg specimens were collected from 520 mice (bilateral knees from 480 mice and unilateral knees from 40 mice), and Safranin-O-stained cartilage images were acquired from each leg. The knee joint cartilage was divided into three regions (anterior, middle, and posterior), and ten 5-µm sections were prepared for each region. One representative slide was selected from each region, resulting in three cartilage images per leg. These images represent distinct anatomical areas, each of which can exhibit varying degrees of cartilage damage. Therefore, assigning the same grade to all images from the same animal was not feasible. A total of 3,000 images were initially collected and divided into training (70%), validation (15%), and test (15%) sets based on mouse ID. As a result, the training set included 364 mice (2,100 images), while the validation and test sets included 78 mice each (450 images per set). Subsequently, images were otherwise unsuitable for analysis were proportionally excluded based on the initial data split ratio. We then excluded 212 images based on predefined quality criteria (e.g., motion/defocus blur, section tearing or folding, staining failure/uneven staining, missing or severely truncated cartilage, and other artifacts that precluded reliable grading). Through this process, we ensured that images derived from the same mouse were not included across different data sets (i.e., training, validation, or test), thereby maintaining strict separation by mouse identity. After this filtering step, a total of 2,788 cartilage images were included in the final analysis
Data Set Composition
The small data set consisted of 544 images randomly selected from the total of 2,788 images, and was divided into 388 training images, 78 validation images, and 78 test images; this small data set was used as a pilot study to evaluate feasibility and identify failure modes. The large data set included all 2,788 images and was split into 1,952 training images, 418 validation images, and 418 test images, and this large data set was used for all primary experiments and final reporting. The training set was used to learn image patterns, the validation set to prevent overfitting, and the test set to evaluate model performance. Three of the four experts who participated in the interrater variability assessment assigned OARSI scores to all 2,788 images, and the reference standard was determined by majority voting. In cases without a majority agreement, the final score was determined through consensus among the three raters
Image Preprocessing
Image preprocessing comprised sequential rotation and cropping to standardize orientation and focus on tibial cartilage. This explicit geometric preprocessing was adopted to prioritize preservation of native histological texture and avoid unnecessary pixel-level transformations. Rotation standardized image orientation, and subsequent regions of interest (ROI) cropping reduced irrelevant background while retaining grading-relevant tibial cartilage morphology. First, to address variations in image orientation, 397 images were selected from the total of 2,788 images based on their wide range of inclination angles. For each selected image, the inclination angle of the cartilage was calculated by connecting its two endpoints and measuring the angle relative to the horizontal axis (allowing both positive and negative values). Each image was then rotated to align the cartilage horizontally. These rotated images were used to train a VGG16-based regression model to predict the inclination angle of cartilage in unprocessed images. The learned weights were subsequently applied to the entire data set to predict the inclination angle relative to the horizontal axis, and each cartilage image was rotated accordingly to produce a cartilage-rotation data set. The mean absolute error (MAE) was used as the loss function for the regression model predicting the cartilage angle. Next, an object detection data set was created by labeling bounding boxes on 337 rotated images to crop only the tibial cartilage region. Of these, 273 images were used for training and 64 for validation. Object detection was performed using the deep learning model you only look once (YOLO-v7). YOLO can detect multiple objects within a single image by dividing the input into a grid and directly predicting bounding boxes and class probabilities for each grid cell, achieving high detection accuracy with rapid inference.
27
This two-stage preprocessing pipeline rotation followed by cropping resulted in a final cartilage rotation-crop data set containing only the tibial cartilage regions, and the entire process was fully automated
Deep Learning Model
Using the prepared data set, we trained the following deep learning classification models. To classify the constructed data set, we employed well-established CNN models including the VGG16, ResNet50, and Xception, due to their proven performance in image classification and regression tasks and their ability to effectively handle complex visual patterns. The VGG16 model features a simple and uniform architecture consisting of 16 weight layers and small convolutional filters (3 × 3) that enable the capture of fine-grained features. The ResNet50 model addresses the vanishing gradient problem using a 50-layer CNN based on residual learning. The Xception model, based on the Inception architecture, leverages extreme separable convolutions for greater efficiency. For the small data set and the tibia-crop data set, training was performed using the VGG16 model only. For the large data set and the cartilage-rotation-crop data set, we trained the VGG16, ResNet50, and Xception models and compared their performance. Data augmentation techniques including zooming in/out, brightness adjustment, and horizontal flipping were applied depending on the data set. Classification was performed by selecting the class with the highest predicted probability from the final softmax layer of each model. Stochastic gradient descent (SGD) was used as the optimizer (learning rate = 0.001, momentum = 0.9), and categorical cross-entropy was applied as the loss function. To prevent overfitting, we added a dropout layer (rate = 0.5) after the fully connected layer of each model. Because each model had a different learning rate and convergence behavior, training was stopped when the validation performance or loss no longer improved, resulting in 30 to 100 training epochs.
Analysis of Learning/Loss Curves and Confusion Matrices
Learning and loss curves were used to evaluate model training and generalization performance, with the X-axis representing epochs and the Y-axis showing loss (left) and accuracy (right). These curves allowed us to assess convergence and to detect potential overfitting or underfitting. To analyze classification performance, a confusion matrix was generated to visualize the Top-N (Top-1 and Top-2) predictions on the test sets for both the small and large data sets. Diagonal elements indicate correct classifications, whereas off-diagonal elements represent misclassifications.
Performance Evaluation
The performance of each model was evaluated by precision (positive predictive value), recall (sensitivity), F1-score, and accuracy, as follows:
Precision = TP/(TP + FP)
Recall = TP/(TP + FN)
F1-score = 2 × (Precision × Recall)/(Precision + Recall)
Accuracy = (TP + TN)/(TP + TN + FP + FN)
Additionally, to complement performance assessment in the multi-class classification setting, Top-1 accuracy (where the top prediction matches the true label) and Top-2 accuracy (where one of the top two predictions matches the true label) were calculated. A multi-class evaluation approach was used because the classification involved six or eight classes, and all calculations were performed using Python’s sklearn.metrics module. 28 For the final preprocessed data set, 95% confidence intervals (CIs) were estimated to provide a measure of statistical uncertainty.
Receiver Operating Characteristic (ROC) Curve and Precision-Recall (PR) Curve Analysis
To evaluate the performance of the deep learning models, we utilized ROC curves and area under the curve (AUC), as well as PR curves and average precision (AP) metrics. The ROC curve was used to assess the trade-off between sensitivity and specificity, while the PR curve provided insight into precision and recall, particularly for imbalanced data. To evaluate the classification performance across the eight grades, we employed a one-vs-rest (OvR) strategy for both ROC and PR curve analyses. Overall model performance was summarized using macro-averaged AUC and AP to give equal weight to each grade. For the macro-averaged no-skill baselines are indicated by red dashed lines in
Visualization of Model Decisions With Gradient-Weighted Class Activation Mapping (Grad-CAM)
To visually explain which regions the model focused on when making predictions, we applied Grad-CAM. 29 The method generates coarse localization maps by visualizing the gradients flowing into the final convolutional layer, thereby highlighting image regions that contribute most to the predicted class.
Robustness Assessment of the Model on Cropped Images
From the original data set of 2,788 images, 10 images were randomly selected from each grade. To assess robustness, each image was modified to include variations in the angle and size of cartilage regions and processed using the trained VGG16 model. For each image, nine versions were generated: the original image and eight differently cropped variants (crop 1-8). This resulted in 90 images per grade. The consistency of the VGG16 model’s predictions across these transformed images was then quantified.
Statistical Analysis
To quantify interrater variability in OARSI scoring, Fleiss’ Kappa values were calculated to measure agreement among multiple expert raters. To assess the reliability of both model performance metrics and rater assessments, 95% CIs were computed, providing a statistical range within which the true values are likely to fall. All statistical analyses were performed using Python’s statsmodels and sklearn.metrics modules.
Model Availability
The final model developed in this study, which was evaluated on the test data, is available online at https://github.com/esfman-git/Osteoarthritis_grading
Results
Interrater Variability in OARSI Cartilage Grading
To evaluate the interrater reliability of the OARSI grading system in assessing cartilage degradation, four independent investigators graded the same Safranin-O-stained tissue samples. Despite the use of a standardized scoring system, differences in observation points and subjective interpretation resulted in inconsistencies among raters

Limitations of OA severity assessment based using the OARSI grading system.
Application of Deep Learning Using Small Data Set
To mitigate the subjectivity associated with the OARSI scoring system, a VGG16-based classification model was trained on a small data set. Following training, the model’s learned weights were applied to a separate test set for prediction. The training curves indicated that the validation loss did not converge to 0, and the validation accuracy did not reach 1

Comparison of prediction accuracy among three CNN models and expert raters on data sets of different sizes.
Deep Learning Model Performance on a Large Data Set
To reduce misclassification errors potentially caused by insufficient training data, we next increased the number of images used for training by expanding the data set. Training curves for loss and accuracy revealed that VGG16 exhibited a stable learning pattern, while ResNet50 and Xception showed irregular validation loss and accuracy
Deep Learning Model Performance on a Tibia-Crop Data Set
A tibia-crop data set was generated by labeling and cropping only the tibial cartilage region using the YOLO-v7 model. The VGG16 model was trained on this data set; however, no significant improvement in classification performance was observed compared to the model trained on the large data set containing the full cartilage region. Specifically, the model achieved a precision of 0.54, recall of 0.57, and an F1-score of 0.53 and an overall accuracy of 0.54. All of which were lower than those obtained with the large data set
Performance Evaluation of VGG16 Model Using Cartilage-Rotation-Crop Data Set
Cartilage images were first rotated to a horizontal orientation using a VGG16-based regression model, then cropped to include only the cartilage region using YOLO-v7-based object detection

A comparison of learning effects of three CNN models on cartilage-rotation-crop data set obtained through preprocessing of original images.
Classification Report for the Cartilage-Rotation-Crop Data Set (0.631–0.665).
TP = True Positive; FP = False Positive; FN = False Negative; TN = True Negative.
Precision = TP/(TP + FP).
Recall = TP/(TP + FN).
F1-score = 2 × (Precision × Recall)/(Precision + Recall).
Accuracy = (TP + TN)/(TP + TN + FP + FN).
MAE = Mean Absolute Error.
The 95% confidence intervals are given in parentheses.

ROC and PR curves for OA severity classification using three CNN models (VGG16, ResNet50, Xception).

Robustness of the VGG16 model to variations in image cropping. Images graded as G1 were used to generate multiple cropped variants (Crop 1–8). The VGG16 model showed consistent classification accuracy between the cartilage-rotation-crop data set (Accuracy: 0.648, 95% CI: 0.631–0.665) and the cropped variants (Accuracy: 0.636, 95% CI: 0.601–0.671). The Kappa value of 0.760 (95% CI: 0.698–0.821) indicates substantial agreement, supporting the model’s robustness to minor variations in cartilage region cropping.
Addressing Grading Ambiguity With a Top-2 Prediction Approach
To address ambiguity near the boundaries between grading categories, a Top-2 prediction approach was employed. Using this scheme, the true values in the confusion matrices for all three CNN models were more concentrated along the diagonal compared to those obtained using the standard Top-1 prediction method
Discussion
This study developed and evaluated CNN models for automated histologic scoring of osteoarthritis severity in mouse cartilage using the OARSI criteria. Among the tested architectures, VGG16 demonstrated the most consistent and accurate performance.
Our results revealed significant interrater variability, particularly for OARSI grades 0.5-4
This study supports preclinical research applicability of deep learning-based OA scoring systems by demonstrating performance comparable with that of trained human raters. The VGG16 model showed the highest similarity to expert grading, with an AUC of 0.943 and an AP of 0.712
Moreover, to provide a more rigorous and objective validation of this two-stage preprocessing pipeline, a manual audit of the independent test set (n = 418) was performed, revealing a 94.0% success rate for the YOLO-v7-based cropping. Notably, the classification error rate in the failed-crop subgroup (60.0%, 15/25) was 2-fold higher than in the successful-crop subgroup (27.2%, 107/393)
When assessing disease severity, such as quantifying cartilage condition according to the OARSI criteria, the boundaries between grades can often be ambiguous, resulting in challenges for accurate evaluation and variability in results. This issue can also arise when deep learning models classify multiple classes.33,34 To account for this inherent grading ambiguity, we explored an alternative evaluation method by considering the top two predicted classes instead of relying solely on the top prediction as the reference standard. While Top-2 classification yielded higher numerical values for accuracy and other performance metrics
This study has several limitations. First, the training data set consisted exclusively of coronal sections. Future research should explore preprocessing methods for cartilage images from various orientations to improve generalizability. Second, classification accuracy may be affected by image resolution, as most misclassifications occur in low-resolution images. Therefore, it is crucial to collect images with sufficiently high resolution
In conclusion, this study successfully automated a histological scoring system for cartilage evaluation in mouse models of OA using deep learning. This approach offers several advantages. First, it enables semi-quantitative assessment of arthritis severity without manually processing large volumes of cartilage images. Second, our deep learning algorithm can objectively evaluate the condition of the mouse knee. Third, while the model fundamentally outputs a single OARSI grade per image, applying the Top-2 prediction approach led to a substantial improvement in grading accuracy and consistency, demonstrating its effectiveness in addressing grading ambiguity and reducing subjectivity in human assessments. An automated histological scoring system for OA is expected to advance OA research using animal models.
Supplemental Material
sj-pdf-1-car-10.1177_19476035261435798 – Supplemental material for Automated Objective Scoring of Osteoarthritis Severity in Mouse Medial Tibial Cartilage Using Deep Learning
Supplemental material, sj-pdf-1-car-10.1177_19476035261435798 for Automated Objective Scoring of Osteoarthritis Severity in Mouse Medial Tibial Cartilage Using Deep Learning by Ka Hyon Park, Young-Gwon Kim, Gyuseok Lee, Su-Jin Kim, Yoonkyung Won, Yun Hyun Huh, Tae-Jong Kim, Jang-Soo Chun, Ho-Jun Song and Je-Hwang Ryu in CARTILAGE
Footnotes
Author Contributions
Funding
The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: Supported by the National Research Foundation of Korea (NRF) grants by the Korean government (MSIT) (2019R1A5A2027521, 2021R1A2C3005727, and 2022R1I1A1A01072371) and a grant of Chonnam National University Hospital Biomedical Research Institute (BCRI25057).
Declaration of Conflicting Interests
The authors declared the following potential conflicts of interest with respect to the research, authorship, and/or publication of this article: Ka Hyon Park reports: National Research Foundation of Korea (NRF) grants funded by the Korean government (MSIT) (Grant No. 2022R1I1A1A01072371), related to this study. Yoonkyung Won reports: National Research Foundation of Korea (NRF) grant funded by the Korean government (MSIT) (Grant No. 2019R1A5A2027521), related to this study. Tae-Jong Kim reports: a grant from the Chonnam National University Hospital Biomedical Research Institute (Grant No. BCRI25057), related to this study. Je-Hwang Ryu reports: National Research Foundation of Korea (NRF) grants funded by the Korean government (MSIT) (Grant Nos. 2019R1A5A2027521, 2021R1A2C3005727) and a grant from the Chonnam National University Hospital Biomedical Research Institute (Grant No. BCRI25057), all related to this study.
Data Availability Statement
The data that support the findings for this study are available to other researchers from the corresponding author upon reasonable request.
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
