Abstract
Deep learning continues to advance imaging-based diagnosis in oral and maxillofacial radiology. This narrative review has synthesized recent deep learning applications for detecting, classifying, and segmenting jaw cystic lesions and maxillofacial tumors on panoramic radiographs and cone-beam computed tomography scans. It has summarized representative one-stage detectors and convolutional neural network/transformer-based classifiers, along with segmentation methods, reported performance metrics, and key use-case considerations. In addition to this synthesis, the review has critically examined dataset constraints, spectrum and site bias, device-related heterogeneity, annotation inconsistency, and gaps in model explainability as well as described how these limitations restrict generalizability. Practical considerations for clinical implementation are also discussed, including workflow placement, quality assurance, and governance, followed by emerging research directions such as federated learning, multimodal fusion, and radiomics–deep learning combinations, each evaluated in terms of feasibility and current evidence maturity. Key evaluation metrics are interpreted in the context of dental imaging. Overall, current findings suggest that deep learning may enhance early and consistent recognition of jaw lesions, support surgical planning through automated delineation, and promote standardized interpretation, provided that models undergo external validation, reporting remains transparent, and deployment is guided by appropriate clinical oversight.
Keywords
Introduction
Odontogenic cysts and tumors of the jaws constitute a substantial portion of oral and maxillofacial pathologies and are second in prevalence only to dental impactions. 1 These lesions, which include periapical cysts, dentigerous cysts, odontogenic keratocysts (OKCs), and benign tumors such as ameloblastomas, often progress insidiously, with many remaining asymptomatic until they enlarge sufficiently to cause swelling, tooth displacement, or even pathologic fracture. 2 Early and accurate diagnosis is essential because management varies considerably by lesion type; for instance, an OKC is typically treated with conservative enucleation, whereas an ameloblastoma frequently requires more extensive resection due to its aggressive nature. Misdiagnosis is a common clinical challenge because these lesions can appear radiographically similar on routine examinations.3–5 For instance, both OKCs and ameloblastomas often present as radiolucent jaw lesions, and distinguishing them on a panoramic radiograph (PR) can be challenging even for experienced clinicians. 6 In clinical practice, such diagnostic errors may result in inappropriate treatment—either overtreatment of a cystic lesion or insufficient surgery for a tumor—with substantial implications for patient outcomes.7,8
Diagnostic imaging is essential for the early detection and characterization of jaw lesions. PR is widely used as an initial screening modality in dentistry and can reveal asymptomatic radiolucent lesions during routine examinations.9–11 However, interpretation of PR is often hindered by overlapping anatomical structures and projection distortions that may obscure true pathology or create misleading appearances. Cone-beam computed tomography (CBCT) is increasingly employed for three-dimensional assessment of maxillofacial pathology and provides improved visualization of lesion boundaries, internal features, and proximity to critical anatomical structures.12,13 CBCT can identify subtle osseous changes that may not be apparent on PR and offers multiplanar views that aid in surgical planning.14,15 Although advanced imaging techniques such as multidetector computed tomography (CT) and magnetic resonance imaging (MRI) are used for large tumors, suspected malignancy, and soft-tissue involvement, PR and CBCT remain the primary imaging modalities for most odontogenic lesions.
Despite the availability of these imaging tools, challenges remain in consistently diagnosing jaw cysts and tumors. Radiographic features are often equivocal, and image interpretation is subject to considerable interobserver variability. 16 Even specialists may disagree on whether a radiolucency represents an OKC or ameloblastoma without biopsy confirmation. In this context, artificial intelligence (AI), particularly deep learning (DL), has the potential to improve diagnostic accuracy and efficiency. DL, a subset of machine learning that uses multilayered artificial neural networks, has achieved notable success in medical image analysis over the past decade. Convolutional neural networks (CNNs), which are loosely modeled on the human visual cortex, can automatically learn complex imaging patterns. In fields such as radiology and pathology, they have achieved expert-level performance in detecting abnormalities when trained on large annotated datasets.17–19 The oral and maxillofacial imaging field has similarly experienced rapid growth in DL applications using dental radiographs and CBCT scans for various tasks such as caries detection and cephalometric landmark identification, with promising outcomes.20,21 Several studies have also investigated CNN-based approaches for identifying and classifying jaw lesions on PRs. 7 Early findings indicate that AI may assist clinicians by highlighting potential lesions and suggesting likely diagnoses, thereby supporting clinical decision-making.
This narrative review critically examines recent DL applications for imaging-based diagnosis of jaw cystic lesions, with a focus on methodological rigor, clinical relevance, and existing evidence gaps. It is intended as a reference for dental researchers, radiologists, and clinicians seeking an overview of current AI approaches in jaw lesion diagnosis and aims to identify opportunities for future improvements that may enhance patient safety and treatment effectiveness.
Methods
This review was conducted following the Scale for the Assessment of Narrative Review Articles (SANRA) guidelines for narrative reviews. 22 Literature searches were performed in PubMed and Scopus from January 2015 to August 2025 using combinations of the following keywords: (“jaw” OR “maxillofacial”) AND (cyst* OR tumor* OR lesion*) AND (panoramic OR orthopantomogram OR CBCT) AND (deep learning OR CNN OR transformer OR segmentation OR detection OR classification). Studies were included if they applied DL to PR or CBCT for the detection, classification, or segmentation of jaw lesions or tumors. Nonimaging studies, purely methodological investigations, and studies not involving DL were excluded. Two authors independently screened titles and abstracts, with disagreements resolved through discussion. Extracted data included imaging modality, task, model architecture, dataset size, validation strategy (including external validation if reported), and performance metrics. This review emphasizes critical appraisal and clinical implications and does not aim to provide a fully systematic summary of all published studies in this field.
DL applications in jaw cystic lesions
Recent research has proposed a range of DL models designed to detect and diagnose odontogenic cysts and related jaw lesions on imaging. The term “jaw cystic lesions” generally encompasses common entities such as radicular cysts, dentigerous cysts, OKCs, nasopalatine duct cysts, and simple bone cysts, many of which appear as radiolucent areas in the jaws. Because these lesions are often radiographically similar, AI faces a twofold task: first, detecting the presence or location of any lesion on an image (object detection), and second, classifying the lesion into the appropriate type (diagnosis). Some studies also incorporate a segmentation step to delineate lesion boundaries for visualization or volumetric analysis. Notable studies and their approaches to these tasks are summarized below. 23
Object detection and localization of jaw lesions
Localizing a lesion on a PR can be challenging because the image is large (often ∼3000 × 1500 pixels) and the lesion may occupy only a small region. 24 Modern one-stage detectors, such as YOLO family and RetinaNet, have largely replaced sliding-window approaches. Across representative studies, high box-level performance has been reported; for example, a YOLOv3 model trained on 1282 PRs achieved ∼0.87 precision for lesion boxes, 25 and subsequent YOLO iterations for mandibular radiolucencies reported ∼0.95 precision with ∼0.94 recall following data augmentation. 6 These results indicate that the models rarely miss lesions and generate few false alarms on the test sets, demonstrating strong performance for automated X-ray analysis.
YOLO’s advantage lies in its ability to perform real-time detection; in a previous study, YOLO could evaluate a batch of 181 panoramic test images in real time, whereas human experts required over half an hour. 26 Two-stage pipelines, such as detector followed by U-Net, remain useful when precise contour delineation is needed, but one-stage models typically provide simpler, near-real-time triage.27,28 Choice between two-stage and one-stage approaches is task-dependent: when the clinical goal is to flag any potential lesion for secondary review, a robust one-stage detector is often sufficient; when planning requires lesion shape or extent, a subsequent segmentation step adds value. Tajima et al. 29 validated YOLOv2 on small datasets, achieving 84.0% sensitivity and 85.8% specificity for cyst-like radiolucencies, demonstrating that optimized small-sample training can mitigate performance loss.
Yang et al.’s YOLOv2 model, trained on 1603 PRs, outperformed human clinicians in precision (70.7%) and recall (68.0%), with a diagnostic accuracy of 66.3%, comparable to that of oral surgeons, indicating that AI can efficiently approach expert-level detection. 26 However, most studies are single‑center and rely on per‑lesion average precision (AP) rather than per‑patient outcomes; dataset splits are sometimes performed at the image level rather than the patient level, risking data leakage. Domain shift due to different vendors, acquisition parameters, or metal artifacts is rarely evaluated. Few studies provide probability calibration or decision‑curve/net‑benefit analyses to determine whether alerts aid clinicians. Consequently, high AP values reflect technical capability under controlled conditions rather than clinic‑ready performance.
For clinical translation, detection studies should (a) report patient‑level sensitivity, specificity, and time saved alongside per‑lesion metrics; (b) include multicenter external validation with scanner and vendor stratification; (c) provide reliability plots/expected calibration error and specify operating thresholds for reported claims; and (d) quantify the review burden of false positives. Incorporating these elements can convert strong technical performance into interpretable value for triage and worklist prioritization in PR evaluation.
Classification and diagnosis of lesion types
After lesion detection, the subsequent task is determining lesion type. Many DL studies have focused on classifying jaw lesions into diagnostic categories using either entire images or localized regions containing the lesion. 30 CNN classifiers were among the earliest approaches, typically requiring the lesion to be approximately centered in the image or provided as input. For example, Poedjiastoeti et al. 31 adapted the Visual Geometry Group (VGG)-16 model on ∼400 PR crops to differentiate ameloblastomas from OKCs with high screening accuracy. Using larger cohorts and more advanced backbones, Lee et al. 32 applied Inception-v3 to combined PR/CBCT inputs to distinguish OKCs, dentigerous cysts, and periapical cysts, reporting ∼80%–90% accuracy. Transfer learning and data augmentation consistently enhance performance; a simple CNN trained from scratch achieving ∼78% accuracy can improve to an accuracy of >90% with pretraining and robust augmentation on small datasets.33,34 This improvement highlights that pretraining on large datasets, even nonmedical images, provides networks with general feature-extraction capabilities that are valuable when data are limited. Analytically, strong results reported using mixed PR and CBCT inputs should be interpreted cautiously, as region of interest (ROI) pre‑selection and cross-modality “shortcuts” can artificially inflate apparent generalization. For clinically asymmetric risks, such as missing an OKC, cost‑sensitive thresholds and an “uncertain—refer” option are preferable to obtain forced single‑label outputs. Studies should routinely report macro/micro-F1 scores, Cohen’s κ, confusion matrices, and calibration to support clinical interpretation. Data augmentation remains equally critical; for instance, Kwon et al. 25 expanded their training set by 12-fold using flips, rotations, and other transformations, which significantly improved YOLO model’s sensitivity and specificity for jaw lesion detection, emphasizing augmentation’s role in mitigating class imbalance and small sample sizes.
Multiple studies have investigated multiclass classification of jaw lesions, which is more challenging than binary classification due to overlapping radiographic features. A meta-analysis by Shoorgashti et al. 35 reported that AI models for OKC detection achieved an overall sensitivity of 83.7% and specificity of 82.9%, with YOLO-based models reaching 96.4% sensitivity and 96.0% specificity, demonstrating their effectiveness on real-world radiographs. Fedato et al. 36 similarly highlighted AI’s strong diagnostic capability for odontogenic lesions while emphasizing study heterogeneity and the need for standardized evaluation methods. Some studies reported area under the curve values as high as 0.95 for specific cyst types, whereas others observed 0.70–0.80 accuracy in more complex scenarios. Overall, AI models trained on high-quality datasets can achieve classification accuracy exceeding 80%–90% for jaw cysts.
Multi‑class classification is more challenging than binary decisions. A two‑branch CNN achieved an average accuracy of 88.7% across four categories (dentigerous cyst, periapical cyst, OKC, and ameloblastoma), with a mean sensitivity of ∼66.6% and higher specificity of ∼92.7%; when simplified to lesion-versus-healthy classification, the accuracy increased to ∼90.7%. 7 Cascade designs that first detect and then classify, such as MobileNetv2 + YOLOv3, outperform classification-only baselines for apical radiolucency subtyping.25,26,37 These studies reported improved performance using the two-stage approach compared with classification alone, highlighting the synergistic effect of detection and classification. Although precision/recall values were not explicitly stated in the reports, their inclusion in this review suggests strong performance on the test sets.
Ensemble and hybrid models have also been investigated. Liu et al. 38 proposed a hybrid VGG-19/ResNet-50 model for ameloblastoma–OKC classification; however, its performance was not directly compared with single VGG-19 or ResNet-50 models, leaving it unclear whether the ensemble design offered advantages beyond individual architectures. These innovative approaches illustrate the field’s evolution from using off-the-shelf CNNs toward developing task-specific networks or hybrid combinations that more effectively capture the nuances of jaw imaging (Figure 1).

Illustration of the deep learning (DL) training process for jaw cystic lesion recognition. DL: deep learning.
Segmentation of cystic lesions
Segmentation, which involves delineating lesion boundaries, is less frequently the primary objective but is included as a component in several studies. 39 Accurate segmentation of a jaw cyst can provide precise information on its size, shape, and volume, which is clinically valuable for surgical planning and follow-up. Although only a few studies employed DL exclusively for jaw lesion segmentation, many incorporated segmentation following detection or for visualization purposes.40–42
The U-Net architecture is the predominant model for medical image segmentation due to its encoder–decoder design, which enables precise localization while preserving contextual information. In jaw imaging, U-Net and its variants have demonstrated strong performance even with limited datasets by leveraging data augmentation and pretraining.42,43 Kirnbauer et al. 44 proposed a two-step approach for periapical lesion analysis on CBCT: first, the tooth and relevant region were identified using a Spatial Configuration-Net, followed by binary segmentation of the lesion using an improved U-Net. This method achieved 97.1% sensitivity and 88.0% specificity for lesion detection on CBCT and reported a high mean Dice coefficient, reflecting overlap between AI segmentation and ground truth. These results indicate that once the ROI was located, the U-Net accurately delineated lesions on CBCT slices. Furthermore, Kirnbauer’s pipeline achieved a “successful diagnosis rate” of up to 97% for dental localization, demonstrating that the method rarely missed lesions when present.
A notable segmentation-focused study by Xu et al. employed a Mask Region-based CNN (R-CNN) to automatically segment ameloblastomas on CT images. 16 Despite a limited training set of 79 cases, extensive data augmentation and cross-validation were applied. The model achieved a Dice coefficient of 0.874 for ameloblastoma volume delineation, indicating high segmentation accuracy. Detection performance, evaluated using AP at an Intersection over Union (IoU) threshold of 0.5, was 91.4%, showing that the model correctly identified lesion regions in the majority of cases. Importantly, external validation was performed on 200 CT images from a separate center, demonstrating strong generalization and providing confidence that the model’s performance is not restricted to the original scanner or patient population.
Mask R-CNN, as applied by Xu et al. and Yeshua et al., is an effective instance segmentation framework that generates both bounding boxes and pixel-level masks.16,45 Yeshua et al. employed the model on 3D CBCT data to detect maxillofacial bone lesions, achieving a per-slice detection sensitivity of 95.9% and precision of 98.9%, with a 3D segmentation Dice coefficient of 83.5%. The high precision reflects minimal false positives, enabling accurate computation of lesion volumes, which supports diagnosis and follow-up. The Dice scores are consistent with Xu et al.’s results for ameloblastomas, demonstrating Mask R-CNN’s reliable performance in jaw lesion segmentation when adequate training data are available.
An innovative variation on U-Net is the Dense U-Net with anatomical constraints. Zheng et al. 30 introduced an anatomically constrained Dense U-Net that incorporated oral anatomical knowledge into the segmentation process. This approach allowed good performance even with a small training dataset, outperforming a standard Dense U-Net in both detection accuracy and Dice coefficient by leveraging known anatomical constraints. The study suggests that integrating domain knowledge with DL can guide models and reduce errors that violate anatomical plausibility. However, the Dice coefficient alone may conceal boundary inaccuracies that could impact surgical planning. Studies should also report metrics such as HD95 and relative volume error, as 2D stacking may compromise 3D topological consistency. Direct 3D architectures or the inclusion of shape and anatomical priors may yield more reliable volumetric outputs. Additionally, external validation cohorts remain limited, and linking segmentation accuracy to downstream clinical outcomes, such as operative windows or recurrence, would enhance clinical relevance.
Collectively, applications in jaw cystic lesions demonstrate that DL can (a) flag radiographs or CBCT scans containing a lesion as a screening aid, (b) suggest the likely diagnosis for decision support, and (c) delineate the lesion for measurement and visualization to assist surgical planning (Figure 2). The synergy of detection, classification, and segmentation is evident: studies combining these tasks often report that each step facilitates the others. For instance, performing segmentation after detection can improve classification accuracy by focusing analysis on the lesion region, while knowledge of the lesion class can, in turn, enhance segmentation performance.

Demonstration of a DL model for jaw cystic lesion recognition. DL: deep learning.
Sources of bias and external validity
Reported diagnostic performance can be influenced by multiple sources of bias, including spectrum and site bias from single-center data, device heterogeneity due to different scanners or acquisition settings, class imbalance, and variability in expert annotations. Small datasets and the absence of external validation further increase the risk of overfitting and inflated performance metrics. To contextualize the results, studies should report detailed cohort characteristics, conduct cross-center evaluations, and include uncertainty estimates where feasible. These considerations are crucial for assessing the readiness of models for clinical deployment.
Clinical integration and limitations
For clinical adoption, three key questions are central: 1. Will the model reduce missed lesions without generating an unacceptable number of false alarms; 2. Does it save net time in PR/CBCT reading under calibrated thresholds; and 3. Is performance stable across scanners, sites, and patient subgroups after deployment? Building on these considerations, limitations and potential remedies can be organized into five areas: data diversity (multi‑center curation), robustness (external validation and drift monitoring), decision‑making (probability calibration, cost‑sensitive thresholds, and “uncertain—refer”), explainability aligned with radiographic signs, and governance (quality assurance, privacy, and regulatory).
Limited data availability and class imbalance
A key limitation in jaw lesion research is the scarcity of large, diverse datasets. Many institutions encounter only a few cases of specific cysts or tumors annually, and assembling thousands of annotated images typically requires multicenter collaboration. As noted, half of the reviewed studies included fewer than 500 images.16,29,31 Models trained on such small datasets exhibit the risk of overfitting, performing well on seen cases but poorly on new patients. Variations in study inclusion criteria further complicate generalization. The issue of class imbalance is closely related: rarer lesions, such as Stafne bone cysts or central giant cell granulomas, may be underrepresented, causing models to favor more common classes.46,47 Data augmentation partially mitigates this by synthetically increasing minority-class samples, but it cannot introduce truly new pathology patterns and only perturbs existing ones. Shi et al. 30 observed category imbalance in many datasets and highlighted augmentation as a frequent remedy. Although augmentation improves model robustness in some cases, it does not replace the need for truly diverse data.
Generalizability and external validation
Generalizability refers to a model’s performance on data outside its training distribution, such as images acquired with different equipment, settings, or populations. Only a few studies have conducted rigorous external validation. Yeshua et al. 45 evaluated their Mask R-CNN on a separate cohort and maintained high Dice and detection metrics. Xu et al. 16 tested their model on CT scans from another center, confirming robustness. Although these results are encouraging, additional external validation studies are needed. Publication bias further complicates assessment: studies reporting favorable results are more likely to be published, whereas those with poor generalization may remain unpublished, potentially skewing perceptions of AI performance. In Shoorgashti et al.’s meta-analysis, Egger’s test indicated possible publication bias (p = 0.042), suggesting that aggregated performance metrics may overestimate the capabilities of an unbiased average model. 35
Lack of explainability
Current DL models often function as black boxes. For many clinicians, especially in fields such as surgery or radiology where nuanced interpretation can influence management, obtaining a result without an explanatory rationale can be unsettling. For instance, an AI system may label a lesion as “OKC with 90% confidence,” but a surgeon would want to understand the basis for this prediction—did the model recognize features such as a scalloped border or minimal expansion, or was the decision influenced by irrelevant factors like image artifacts? Trust in such outputs is difficult to establish without explanation.
Efforts to improve interpretability include visualization tools such as Gradient-weighted Class Activation Mapping (Grad-CAM), a CNN technique that requires no model modification or retraining. Grad-CAM produces a heatmap in which warmer colors indicate regions most influential to the model’s decision, while cooler colors correspond to less relevant areas. This approach helps bridge the “black box” gap, allowing clinicians to verify that the model focuses on meaningful clinical features rather than artifacts when diagnosing jaw lesions. 48 Some studies applied Grad-CAM to confirm that the CNN concentrated on the lesion area rather than extraneous regions during classification. 49 However, these heatmaps have limitations: if the model misclassifies, the highlighted region may be off-target or misleading, and Grad-CAM does not specify features. It does not convey reasoning such as “the decision was based on the lesion’s scalloped margins and epicenter in the ramus,” which a radiologist would typically provide.
Data privacy and regulatory concerns
Medical images constitute protected health information, and sharing them for AI development raises privacy concerns. Strict regulations, such as the Health Insurance Portability and Accountability Act (HIPAA) in the United States and the General Data Protection Regulation (GDPR) in Europe, often complicate multi-source data aggregation.50–52 This challenge motivates the use of federated learning, in which images remain on local servers and only model weights or gradients are shared, enabling collaborative training without exposing patient data. 53 Federated learning has emerged as a promising strategy in dentistry for overcoming data silos while maintaining privacy. Early studies indicate that dental AI models can be trained in a federated manner with performance close to that of traditional centralized training. 54 Nonetheless, this approach introduces additional complexity in coordination and regulatory oversight.
Acknowledging limitations does not diminish the achievements of AI. The ability of AI to detect small jaw lesions with high fidelity or differentiate morphologically similar cysts at a level comparable to experts demonstrates substantial research value. In total, 12 DL studies met the inclusion criteria for jaw-related imaging tasks, of which 9 primarily addressed detection and/or classification, and 5 focused on PR/CBCT-based segmentation; several studies contributed to more than one task. As part of our critical evaluation, Table 1 shows representative DL studies on jaw lesions, highlighting their core contributions and primary limitations in line with the discussions presented in this section.
Representative DL studies on jaw lesions and their key limitations.
Acc: accuracy; CBCT: cone-beam computed tomography; CNN: convolutional neural network; CT: computed tomography; DCNN: deep convolutional neural network; DL: deep learning; PR: panoramic radiograph; Prec: Precision; Rec: recall; ROI: region of interest; Sens: sensitivity; Spec: specificity; YOLO: You Only Look Once; 3D” three‑dimensional.
Future directions
Short- to mid-term progress in jaw lesion imaging is likely to arise from multicenter curation of diverse datasets, transparent reporting, and externally validated models prospectively tested within clinical workflows. Promising technical approaches, including federated learning, multimodal fusion, and radiomics–DL hybrids, may enhance model robustness, but their clinical utility depends on governance, calibration, and sustained post-deployment monitoring. Explainability methods should advance beyond heatmaps toward clinically meaningful rationales aligned with dental radiology practice. Ultimately, successful integration will require regulatory compliance, attention to human factors, and demonstration of additive value compared with standard care.
Conclusion
DL demonstrates substantial potential in assisting the detection and delineation of jaw cystic lesions and maxillofacial tumors on PR and CBCT. When developed with diverse datasets and externally validated, these tools may facilitate earlier and more consistent diagnosis and inform surgical planning. Safe and effective clinical integration requires transparent reporting, appropriate governance, and prospective evaluation within real-world workflows. However, limitations in this review—including the absence of quantitative synthesis, restriction to partial imaging modalities, and inconsistent study methodologies—introduce potential result bias, limit insights into practical application, and hinder comparison of AI architectures, highlighting the need for standardized research reporting to improve future work.
Footnotes
Acknowledgments
Prof Kaijin Hu assisted with English language polishing, limited to grammar and style; all authors reviewed and approved the final manuscript.
Author contributions
Conceptualization: B.Z., Y.L., and C.L.; Methodology and investigation: B.Z. and Y.L.; Writing—original draft: B.Z. and Y.L.; Writing—review & editing: J.S., S.L., and C.L.; Supervision: C.L.
Data availability statement
No new data were generated or analyzed in this study. Figures 1 and
are author-created schematic illustrations that contain no third-party copyrighted material and no identifiable patient information.
Declaration of conflicting interest
The authors declare that they have no conflicts of interest.
Funding
This work was supported by the Key Research and Development Program of Shaanxi Province-Key Industry Innovation Chain (Group)-Social Development Field under No. 2024SF-ZDCYL-01-15.
