Abstract
Aim
To evaluate the diagnostic performance of an artificial intelligence (AI) system in detecting common dental diseases on panoramic radiographs and to compare its findings with expert manual interpretation.
Background
AI has gained increasing attention in dental radiology as a potential tool for supporting image interpretation. Although several studies have reported promising results, the real-world diagnostic performance of AI systems across diverse dental pathologies remains insufficiently validated.
Methods
This retrospective diagnostic accuracy study analyzed 100 panoramic radiographs using the Second Dentists® AI software (Velmeni Inc., USA). AI-generated binary outputs were compared with consensus diagnoses established by two experienced oral and maxillofacial radiologists blinded to AI results. Analysis was performed per radiograph, allowing multiple coexisting conditions. Diagnostic performance was assessed using sensitivity, specificity, precision, accuracy, F1-score, and agreement measures with 95% confidence intervals.
Results
The AI system demonstrated high diagnostic performance for missing teeth (MT), fixed prosthesis (FiP), root stumps (RS), and dental caries (DC). Lower sensitivity was observed for periapical pathology (PP), which accounted for the majority of false-negative (FN) findings. Agreement between the AI and expert interpretation was high for most categories, although it was reduced for subtle lesions.
Conclusion
AI shows potential as an adjunctive tool for interpreting panoramic radiographs, particularly for well-defined dental conditions. However, reduced sensitivity for subtle pathologies, methodological constraints, and limited external validation highlight the need for cautious clinical implementation and further multicenter evaluation.
Keywords
Introduction
Oral health is a vital component of general well-being, yet dental diseases continue to affect millions of individuals worldwide. When left untreated, these conditions may lead to pain, infection, functional impairment, and significant deterioration in quality of life, productivity, and social functioning. 1 Early and accurate diagnosis is therefore essential for effective treatment planning, disease prevention, and long-term oral health maintenance.
In routine dental practice, diagnosis relies on both clinical examination and radiographic assessment. Panoramic radiography is widely used because it provides a comprehensive view of the maxillofacial region in a single exposure and helps detect a range of dental and osseous abnormalities, including caries, periapical lesions, impacted teeth (I), restorations, and bone pathologies. 2 However, its interpretation remains operator-dependent and may be influenced by clinical experience, visual perception, and cognitive bias, leading to inter-observer variability and diagnostic inconsistency.3, 4 In addition, inherent limitations such as geometric distortion, image superimposition, ghost images, and lower spatial resolution can reduce its accuracy for detecting subtle lesions, especially in anatomically complex regions.2, 3 Recent advances in artificial intelligence (AI), particularly machine learning and deep learning, have led to increasing interest in automated dental image analysis. In dentistry, AI has been explored for detecting caries, PP, periodontal bone loss, and restorations, with several studies reporting diagnostic performance comparable to trained clinicians under controlled conditions.2, 5, 6
Deep learning, particularly through convolutional neural networks (CNNs), has shown considerable success in image-based pattern recognition and classification. CNN-based models have been applied to various dental imaging modalities, including panoramic and bitewing radiographs, with encouraging results in detecting apical lesions, proximal caries, and periodontal defects. 5 Some investigations have reported that AI-assisted interpretation may outperform less-experienced practitioners in specific diagnostic tasks, such as the identification of periapical changes. 2 These findings highlight the potential role of AI as a supportive tool in clinical decision-making.
Despite promising results, existing literature reveals several important limitations. Many validation studies utilize curated datasets consisting of high-quality radiographs acquired under optimal conditions, which may not reflect routine clinical practice, where positioning errors, motion artifacts, and variable exposure are common. Moreover, methodological heterogeneity, limited external validation, small sample sizes, and inconsistent reference standards restrict the generalizability of reported outcomes. The frequent use of per-image rather than per-tooth analysis and the presence of class imbalance further complicate the interpretation of diagnostic accuracy metrics. Consequently, reported performance values may overestimate real-world clinical effectiveness.
In resource-limited settings, where access to trained oral and maxillofacial radiologists may be restricted, AI-assisted radiographic interpretation could potentially enhance diagnostic efficiency and reduce disparities in oral healthcare delivery. However, before such systems can be integrated into routine clinical workflows, a rigorous evaluation of their diagnostic performance, limitations, and clinical risks is essential.
Therefore, the present study aims to evaluate the diagnostic performance of an AI system in detecting common dental diseases on panoramic radiographs and to compare its findings with expert manual interpretation. By systematically assessing sensitivity, specificity, precision, and agreement measures, this study seeks to provide an objective appraisal of the potential and limitations of AI as an adjunctive tool in oral radiology, while emphasizing the need for cautious clinical implementation and further multicenter validation.
Materials and Methods
Study Design and Setting
This retrospective observational diagnostic accuracy study was conducted in the Department of Oral Radiology at Rishiraj College of Dental Sciences and Research Centre, Bhopal, India. The study was designed to evaluate the diagnostic performance of an AI system in detecting common dental conditions on panoramic radiographs by comparing its outputs with expert manual interpretation. The study was conducted in accordance with the STARD (Standards for Reporting Diagnostic Accuracy Studies) guidelines. The AI system was used solely as a diagnostic support tool, and all final interpretations were based on expert consensus to ensure clinical responsibility and patient safety.




Ethical Approval
Ethical clearance for the study was obtained from the Institutional Ethics Committee prior to data collection. As the investigation was retrospective in nature and utilized anonymized radiographic records, the requirement for informed consent was waived in accordance with the institutional guidelines.
Sample Size Determination
The sample size was calculated using the Cochran formula for diagnostic accuracy studies. The calculation was based on an assumed expected sensitivity of 85%, an estimated disease prevalence of 40%, a precision of 10%, and a confidence level of 95%, which yielded a minimum required sample size of 92 panoramic radiographs. To compensate for potential exclusions and incomplete data, a total of 100 radiographs were ultimately included in the study. Although the calculated minimum sample size was achieved, the dataset remains relatively limited for evaluating AI performance. This study was designed as a preliminary validation, and future studies with larger, multicenter datasets are required to improve robustness and generalizability.
Sampling Method
Purposive sampling was employed to select panoramic radiographs that fulfilled the predefined eligibility criteria. Only images with adequate contrast, clarity, and diagnostic quality were included to ensure reliable interpretation by both the AI system and the expert evaluators. Although this approach enhanced internal validity, it may have introduced selection bias and limited the generalizability of findings to routine clinical settings.
Eligibility Criteria
High-quality panoramic radiographs obtained from individuals aged 15 years and above with predominantly permanent dentition, including partially edentulous cases, were included in the study. Radiographs that were blurred, incomplete, or exhibited positioning errors were excluded. Images containing artifacts such as earrings, spectacles, or metallic shadows were also excluded. Additional exclusion criteria included retained deciduous teeth or mixed dentition, severe dental crowding exceeding 8 mm per arch, completely edentulous jaws, and radiographs demonstrating fractures, cysts, tumors, or extensive pathological lesions (as shown in Figure 1 and 3).
AI System Description and Evaluation
All selected panoramic radiographs were analyzed using the Second Dentists® AI software (Velmeni Inc., USA). The software was operated in locked mode and provided binary outputs indicating the presence or absence of predefined diagnostic categories. No recalibration, retraining, or modification of the algorithm was performed using the study dataset. The system automatically detected DC, PP, RS, MT, I, RC_RE, and FiP. Default manufacturer-defined thresholds were applied for classification. Detailed information regarding the internal architecture, training dataset composition, and validation methodology of the Second Dentists® AI system was not publicly available, which represents a limitation in terms of reproducibility and interpretability. The system is presumed to be based on deep learning models, likely CNNs, trained on large annotated radiographic datasets. The AI system generated color-coded overlays highlighting detected abnormalities, including DC, periapical lesions, and restorations (as shown in Figure 2 and 4).
AI Color Interpretation
The Second Dentists® AI system uses color-coded overlays to distinguish various findings:
Light blue: DC Pink: Restoration/filling Light yellow: Root canal-treated teeth Yellow: FP Red: PP Violet: RS Light pink: I
These color-coded annotations assist clinicians by visually highlighting areas of interest. However, interpretation should always be supplemented with clinical judgment to avoid misclassification or over-reliance on automated outputs.
Reference Standard and Reader Methodology
Two experienced oral and maxillofacial radiologists, each with more than 10 years of clinical experience, independently interpreted all panoramic radiographs. Both readers were blinded to the AI outputs and to each other’s assessments. Prior to formal evaluation, the radiologists underwent calibration using a pilot set of 20 radiographs that were not included in the final analysis. All interpretations were conducted in a single session under standardized viewing conditions. Inter-rater agreement between the two experts was assessed using Cohen’s kappa before consensus. In cases of disagreement, joint review and discussion were undertaken until consensus was achieved, and the agreed diagnosis was considered the reference standard. No supplementary clinical, cone beam computed tomography (CBCT), or histopathological confirmation was available, and this limitation was acknowledged in the analysis. The relatively small sample size of 100 panoramic radiographs represents an important limitation. Although the sample size was statistically justified, larger datasets are generally required for the robust validation of AI models. Future multicenter studies incorporating heterogeneous populations, imaging systems, and clinical settings are essential to enhance external validity.
Diagnostic Criteria
Standardized diagnostic criteria were established prior to image evaluation. DC was defined as radiolucency involving enamel and/or dentin. PP was identified as a well-defined periapical radiolucency corresponding to a Periapical Index score of three or higher. RS were defined as residual root fragments measuring less than one-third of the original crown height. MT were recorded when the complete absence of tooth structure was observed in the dental arch. I were identified as unerupted teeth embedded in bone beyond the normal eruption period. Restorations were recognized by the presence of radiopaque filling materials. RC_RE were identified by visible obturation material within the canals. FiP was identified as a prosthetic structure identified by radiopaque crowns and abutment teeth.
Unit of Analysis and Handling of Multiple Findings
The panoramic radiograph served as the unit of analysis in this study. Each radiograph was allowed to contribute to more than one diagnostic category, as multiple dental conditions could coexist within a single image. For each category, the presence of at least one corresponding lesion was recorded as positive. Diagnostic performance metrics were calculated independently for each condition without pooling multiple labels. No composite disease score was generated. This per-radiograph approach was adopted for feasibility but may have resulted in overestimation of diagnostic performance compared with per-tooth analysis. This approach was selected to simulate clinical decision-making, where diagnosis is often made at the patient level. However, this methodology may overestimate diagnostic performance as multiple lesions within a single radiograph are not independently evaluated.
Outcome Measures
For each diagnostic category, TP, TN, FP, and FN were recorded based on a comparison between AI outputs and the reference standard. These parameters formed the basis for calculating diagnostic performance indicators.
Statistical Analysis
Statistical analysis was performed using Statistical Package for the Social Sciences (SPSS) version 26.0 (IBM Corp., USA). For each diagnostic category, sensitivity, specificity, precision, accuracy, and F1-score were calculated along with their corresponding 95% confidence intervals. Agreement between radiologists the AI system and the reference standard were assessed using Cohen’s kappa. To account for class imbalance, prevalence-adjusted bias-adjusted kappa (PABAK) was also computed. Receiver operating characteristic (ROC) analysis and area under the curve (AUC) estimation were not performed, as the AI system provided only binary outputs without continuous probability scores. A two-sided p value of less than .05 was considered statistically significant.
Results
A total of 100 panoramic radiographs were analyzed using the AI system and independently interpreted by expert radiologists. The diagnostic performance of the AI system was evaluated for each predefined dental condition using sensitivity, specificity, precision, accuracy, and F1-score, with corresponding 95% confidence intervals.
Overall agreement between AI and expert interpretation exceeded 90% for DC, RS, RC_RE, MT, and FP (Table 1). High accuracy was primarily observed for conditions characterized by well-defined radiographic features, such as MT and FiP. For these categories, both sensitivity and specificity values were consistently high, indicating reliable detection and low rates of misclassification.
Overall Agreement Between Artificial Intelligence (AI) and Expert Interpretation.
DC, RS, and RC_RE also demonstrated favorable diagnostic performance, with sensitivity and specificity values exceeding acceptable clinical thresholds. The F1-scores for these conditions reflected balanced performance between precision and recall. Detailed numerical values, including 95% confidence intervals, are presented in Table 2.
Diagnostic Performance of Artificial Intelligence (AI) Compared with Reference Standard.
In contrast, PP showed comparatively lower diagnostic sensitivity and accounted for the highest number of FN findings (Table 3). Although specificity for this category remained high, the reduced sensitivity indicates that a proportion of periapical lesions detected by expert radiologists were not identified by the AI system. These FN findings represent a clinically relevant limitation, as delayed detection of PP may affect treatment planning and disease progression.
Cases in Which Artificial Intelligence (AI) Failed to Detect Expert-confirmed Findings (FN).
FP findings, in which the AI system detected abnormalities not confirmed by manual interpretation, were observed across multiple categories, particularly in DC and restorative assessments (Table 4). While the overall frequency of FP was limited, their presence may lead to unnecessary clinical investigation if AI outputs are interpreted without professional verification.
Cases in Which Artificial Intelligence (AI) Detected Findings Not Confirmed by Experts (FiP).
Abbreviations
Agreement analysis demonstrated high observed agreement between AI and the reference standard; however, Cohen’s kappa values varied across diagnostic categories (Table 2). Lower kappa values were observed in categories with marked class imbalance, indicating that high agreement was partly influenced by prevalence effects. The discrepancy between high observed agreement and relatively low Cohen’s kappa values can be explained by the “kappa paradox,” a statistical phenomenon observed in datasets with class imbalance. When the prevalence of a condition is either very high or very low, agreement due to chance increases, resulting in lower kappa values despite high observed agreement. In the present study, several diagnostic categories demonstrated an imbalance between positive and negative cases, which influenced the kappa statistics. Therefore, kappa values should be interpreted alongside PABAK and other diagnostic performance measures.
ROC analysis and AUC estimation were not performed, as the AI system generated binary outputs without continuous probability scores. Consequently, tables and figures related to ROC and AUC analysis were excluded from the final analysis.
A summary of the overall agreement between AI and expert interpretation is presented in Table 1. The distribution of FN and FP findings is detailed in Tables 3 and 4, respectively. Comprehensive diagnostic performance metrics with confidence intervals are provided in Table 2.
Discussion
The present study evaluated the diagnostic performance of an AI system for detecting common dental conditions on panoramic radiographs by comparison with expert manual interpretation. The findings demonstrate that the AI system achieved high performance for conditions characterized by distinct radiographic features, particularly MT and FP, as well as favorable accuracy for DC, RS, and RC_RE. Similar results have been reported in previous studies evaluating AI-assisted interpretation of panoramic radiographs, which demonstrated strong performance in identifying well-defined dental structures.4, 7 These findings suggest that AI-assisted interpretation may enhance workflow efficiency and support clinical decision-making in appropriately selected clinical settings. 8
In contrast, comparatively lower sensitivity was observed for PP, indicating reduced detection of subtle lesions located in anatomically complex regions. 9 This limitation is consistent with earlier investigations reporting reduced accuracy of AI systems in detecting early or small periapical lesions, particularly on two-dimensional radiographs.10, 11 Although specificity for this category remained high, the increased proportion of FN findings is clinically relevant, as undetected periapical disease may delay appropriate endodontic or periodontal intervention. These findings reinforce the need for continued clinician oversight when interpreting AI-generated outputs.12, 13
The discrepancy between high observed agreement and relatively low Cohen’s kappa values reflects the influence of class imbalance and prevalence effects, commonly referred to as the “kappa paradox.” Similar statistical patterns have been reported in previous diagnostic accuracy studies involving imbalanced datasets.1–3 Consequently, agreement measures should be interpreted in conjunction with prevalence-adjusted metrics and complementary performance indicators.14, 15
Comparison with published literature demonstrates general consistency with studies reporting favorable diagnostic performance of deep learning models in dental imaging.4, 7, 16 However, most available studies, including the present investigation, rely on curated datasets with optimal image quality. Systematic reviews and meta-analyzes have emphasized that such methodological characteristics may lead to overestimation of real-world performance.17–19
Furthermore, the use of per-radiograph analysis rather than per-tooth evaluation may have contributed to the inflation of diagnostic accuracy values. The use of per-radiograph analysis instead of tooth-level evaluation may have resulted in the inflation of diagnostic performance metrics. While this approach improves feasibility and reflects real-world screening scenarios, it reduces spatial diagnostic precision and may obscure localized errors. Additionally, the inclusion of only high-quality panoramic radiographs may have introduced selection bias, limiting applicability to routine clinical practice where image quality is variable. The reference standard, based on expert consensus without adjunctive CBCT or clinical validation, may also have introduced diagnostic uncertainty, particularly for subtle lesions. Future studies should incorporate lesion-level analysis, include diverse image qualities, and utilize multimodal validation to enhance clinical applicability. Future studies should prioritize tooth-level or region-specific analysis to improve clinical applicability.8–10
The strengths of the present study include the use of a standardized reference protocol with blinded expert evaluation, calibration of radiologists prior to assessment, and the evaluation of multiple clinically relevant dental conditions within the same dataset. Additionally, the use of real-world panoramic radiographs enhances the clinical relevance of the findings compared to purely experimental datasets.
AI-assisted interpretation of panoramic radiographs may have important clinical applications in screening programs, teledentistry, and educational settings. In resource-limited environments, AI systems may support early detection of dental diseases and reduce dependence on specialist interpretation. Additionally, AI tools may serve as training aids for dental students by providing immediate feedback and standardized diagnostic suggestions. Overall, the findings of the present study support existing evidence indicating that AI systems may serve as useful adjunctive tools in oral radiology, particularly for detecting well-defined abnormalities.4, 7, 8 Nevertheless, persistent challenges in identifying subtle pathologies underscore the need for cautious clinical implementation and continued algorithmic refinement.6, 11, 13 Future research should focus on large-scale multicenter validation, integration of multimodal imaging data, and development of explainable AI systems to improve clinical trust and diagnostic transparency.
Conclusion
AI demonstrated high diagnostic performance for well-defined dental conditions on panoramic radiographs; however, reduced sensitivity for subtle pathologies and methodological limitations necessitate cautious clinical implementation and further multicenter validation.
Footnotes
Acknowledgements
The authors would like to express their sincere gratitude to Velmeni (USA) for providing access to the “Second Dentist” software. The authors are also grateful to Rishiraj College of Dental Sciences and Research Centre for providing them with the resources and support they needed to complete this project.
Declaration of Conflict of Interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Ethical Approval and Informed Consent
Institutional Ethics Committee (IEC) approval was obtained prior to study commencement. Due to the retrospective design, informed consent was waived.
Funding
The authors received no financial support for the research, authorship, and/or publication of this article.
