Sage Journals: Discover world-class research

Abstract

Vision-language models (VLMs) have demonstrated significant potential in medical image analysis, yet their application in intraoral photography remains largely underexplored due to the lack of fine-grained, annotated datasets and comprehensive benchmarks. To address this, we present MetaDent, a comprehensive resource that includes 1) a novel and large-scale dentistry image dataset collected from clinical, public, and web sources; 2) a semistructured annotation framework designed to capture the hierarchical and clinically nuanced nature of dental photography; and 3) comprehensive benchmark suites for evaluating state-of-the-art VLMs on clinical image understanding. Our labeling approach combines a high-level image summary with point-by-point, free-text descriptions of abnormalities. This method enables rich, scalable, and task-agnostic representations. We curated 60,669 dental images from diverse sources and annotated a representative subset of 2,588 images using this meta-labeling scheme. Leveraging large language models (LLMs), we derive standardized benchmarks: approximately 15,000 visual question answering (VQA) pairs and an 18-class multilabel classification dataset, which we validated with human review and error analysis to justify that the LLM-driven transition reliably preserves fidelity and semantic accuracy. We then evaluate state-of-the-art VLMs across VQA, classification, and image captioning tasks. Quantitative results reveal that even the most advanced models struggle with a fine-grained understanding of intraoral scenes, achieving moderate accuracy (e.g., less than 70% in VQA) and producing inconsistent or incomplete descriptions in image captioning. These findings underscore the gap between general-purpose VLMs and the demands of specialized models, highlighting the need for domain-adapted training and more sophisticated evaluation protocols to assist professional dental practice and community oral health efforts. We publicly release our dataset, annotations, and tools to foster reproducible research and accelerate the development of vision-language systems for dental applications.

Keywords

artificial intelligence benchmark dataset digital health intraoral photography large language models

Introduction

Dental photography plays a crucial role in diagnosis, treatment planning, patient education, and documentation across dental disciplines (Ding et al 2023; Caron et al 2025; Mania et al 2025). In this context, recent advancements in vision-language models (VLMs) offer a compelling opportunity for automated and scalable image interpretation. By leveraging large-scale annotated datasets, VLMs have demonstrated remarkable capabilities in understanding and reasoning about visual content in natural images (Radford et al 2021; Liu et al 2023).

In health care research, VLMs have been increasingly utilized in analyzing visual and textual data, aiding professionals in detection, diagnosis, and reporting to improve efficiency (Li et al 2023; Ghosh et al 2024; Ryu et al 2025). Despite these advances, few studies report VLM performance for dental image understanding. A concurrent study established a benchmark for panoramic radiograph analysis and found no significant difference between general-purpose and medical-specific VLMs in dentistry (Hao et al 2025). Another study collected a mixture of large-scale X-ray and intraoral images and fine-tuned a VLM using derived data mainly from categorical labeling (Meng et al 2025). While the result looks promising, the dataset and the model are not publicly available.

Based on our preliminary observations, even state-of-the-art VLMs struggle with a fine-grained understanding of intraoral images. We attribute this limitation to 2 interrelated challenges:

Dental diagnosis often requires nuanced, hierarchical interpretations that go beyond closed-set classification. Current VLMs, trained primarily on categorical labels or vague text descriptions, lack the capacity to reason over such fine-grained and clinically contextual features.

The development and evaluation of VLMs for dentistry are severely constrained by the scarcity of well-annotated, diverse, and publicly available datasets. Most existing datasets for intraoral images focus on specific categories (Dot et al 2024; Nguyen et al 2025; Wang et al 2025) and lack diversity across different data sources (Huang et al 2024; Uribe et al 2024).

To address the above limitations, we curated a dedicated dataset and benchmarked state-of-the-art VLMs, introducing innovations in 2 key aspects: annotation strategy and data sourcing. First, we proposed a novel semistructured labeling strategy that formulates the annotation as an open-set weak labeling task for abnormality detection. This approach provides a compact yet comprehensive representation of each image, which can be reliably translated into various downstream task-specific formats. Second, we constructed a large-scale image dataset with substantial diversity by filtering and curating images from web-scraped data, and then we labeled a subset with the proposed semistructured scheme to enable both precise annotation of abnormalities and holistic scene understanding.

In this work, we introduce MetaDent, a semistructured annotation framework and large-scale resource for vision-language understanding of intraoral images. MetaDent is designed to support diverse downstream tasks—including visual question answering, multilabel classification, and image captioning. The name “MetaDent” reflects its meta-annotation structure, broad data diversity, and adaptability across a wide range of dental artificial intelligence (AI) applications.

To encourage reproducibility and further research, we publicly release the dataset, annotation interface, and benchmarking tools at https://menxli.github.io/metadent/.

Materials and Methods

Ethics Statement

The study protocol was approved by the institutional ethics committee of the Hospital of Stomatology, Wuhan University (No. WDKQ2025[C02]). The study adhered to the ethical considerations outlined by the committee, including participant privacy and data protection measures.

Meta Dataset Curation

We collected images from 3 sources: in-house collected, public dataset, and web-crawled. Specifically, we collected 4,373 clinical photographs from the Department of Prosthodontics at the School of Stomatology, Wuhan University (Data Source 1, DS1). For the public component, we included 9,390 images from the Teeth or Dental image dataset (Data Source 2, DS2) (Chaudhary et al 2024), given its relatively large quantity and good quality. Lastly, we filtered the COYO-700M (Byeon et al 2022), a large-scale web-crawl image corpus, using a fine-tuned ViT-L/16 binary classifier to retain only dental images and an image hasher (Haviana and Kurniadi 2016) for duplicate removal (details in Appendix Section 1). This step resulted in 46,906 images (Data Source 3, DS3). In total, the collected dataset comprises 60,669 dental images.

From this combined image dataset, we randomly sampled a subset of 3,576 images for human review. We excluded images lacking clinical relevance (e.g., artificial images) or sufficient quality (e.g., blurry images), resulting in a final set of 2,588 images for annotation. The annotation follows a semistructured format, where each image is assigned an overall descriptive summary and a list of identified abnormalities (Fig. 1). The overall description is a concise paragraph that introduces the main content of the image as well as the shooting perspective. Abnormalities were defined as any clinically relevant deviations from normal dental anatomy or healthy tissue appearance. Annotators were instructed to list the abnormalities point-by-point using natural, unstructured language focused on visual appearance, including diagnostic interpretations where applicable. Meanwhile, the annotators also draw a rough contour for each abnormality in the image for its corresponding entry.

Figure 1.

Data processing pipeline. (A) Composition and distribution of the dataset across 3 sources, with the Internet-scraped subset (Data Source 3) being the largest. Image features were extracted using DINOv3 (Siméoni et al 2025) and projected into 2D using principal component analysis; darker, larger dots represent labeled samples, while lighter and smaller dots indicate the remaining images in the collection. (B) The images are compactly labeled in a semistructured pattern as an overall description of the main visual content, as well as point-by-point, free-text descriptions of abnormalities. (C) Meta-labels are designed for scalability and can be flexibly converted into various output formats to support downstream tasks by leveraging the reasoning capability of large language models.

All images were annotated by 2 dentists with 7 and 4 y of clinical experience, respectively. Prior to formal annotation, the annotators underwent a 1-mo training, during which they received guidance from a senior dentist with 10 y of clinical experience. Training sessions included experimental annotation and corrective feedback. During the formal annotation, any ambiguous or challenging cases were discussed among all 3 experts. When the visual evidence was inconclusive but the annotators had moderate confidence, descriptions were used with cautious wording. When confidence was very low, the entry was flagged as “uncertain.” To assess the consistency between the raters, interrater reliability was evaluated. Cohen’s κ coefficient, calculated on 100 images randomly drawn from the dataset, was 0.83 (Appendix Section 10), indicating a high level of agreement. To facilitate labeling all abnormalities exhaustively, after the initial annotation phase, the 2 annotators cross-verified each other’s labels, ensuring that every image was reviewed by at least 2 raters. The initial annotations were in Chinese and translated into English via large language models (LLMs); all LLM-assisted steps in the study were carried out using GPT-OSS-120B (OpenAI et al 2025). Of the 2,588 images processed, the labeling effort yielded 6,314 confidently annotated records and 138 entries flagged as uncertain.

Secondary Dataset Generation

Following the meta-dataset labeling, we converted the semistructured data into standardized formats with the help of an LLM. In particular, we chose 3 tasks: visual question answering (VQA), multilabel classification, and image captioning. This process effectively expands its size and applicability for diverse downstream tasks. The integration of LLM is based on 2 motivations: 1) our semistructured label is, in essence, a compact and complete representation of the intraoral image, which makes the deduction of unmentioned normal structures possible, and 2) current VLMs still face limitations in reliability, particularly in specialized domains (Jeong et al 2024; Nath et al 2025). In contrast, LLM is far more reliable in terms of reasoning and less hallucination when given an appropriate context (Li et al 2024). We harnessed linguistic reasoning to achieve more robust and scalable data processing.

For the VQA generation, we allowed the model to reason about nonmentioned common abnormalities and questioning beyond explicit labels. We generated 5 to 10 questions per image depending on the number of labeled entries. When fewer than 2 abnormalities were labeled in an image, we generated 5 questions; otherwise, 10 would be raised. The question types included judgment (true/false) and multiple-choice. To enhance the quality of the generated VQA pairs, a self-refinement step was applied (Madaan et al 2023). Entries marked as “uncertain” were handled with exclusion from downstream analyses to minimize ambiguity (Appendix Section 5).

For the classification, we established 18 classes based on visual appearance, with clinical pathology serving as a guiding reference (Fig. 4C and Appendix Table 1). For example, “chalky patches” was grouped into “tooth color abnormality,” regardless of whether they stemmed from early caries, enamel hypoplasia, fluorosis, or postorthodontic lesions. Likewise, plaque and calculus were sometimes visually indistinguishable in photographs and merged into 1 class. This image-centric approach balances practicality and medical relevance and is presumably more effective than a strict clinical diagnosis from the perspective of image analysis.

For the image captioning, the LLM was prompted to generate free-form descriptions from the meta-labels, which were used as reference captions.

To ensure data quality and to better understand the sources of errors, we analyzed errors arising during the transition, focusing on VQA and multilabel classification. During dataset generation, the LLM was prompted to provide explanations for its answers to support human review. Two annotators evaluated the entire classification dataset and randomly sampled 2 VQA pairs per image to analyze errors and make necessary revisions or deletions. We identified and defined 6 common error types; detailed descriptions of each are provided in Appendix Table 2.

The above steps resulted in 18,416 VQA pairs and 2,588 multilabel classification questions for subsequent assessment. Prompts for generation are supplied in Appendix Section 5.

Dataset Summary

Figure 2 characterizes MetaDent’s visual properties. Most images span 10⁵ to 10⁶ pixels (width: ~200–1,000 px), with aspect ratios varying by source: DS3 shows the broadest spread, with DS2 the most uniform (Fig. 2A, B). Analysis by Ovis (Lu et al 2024) reveals 80% depict humans; among these, eyes and nose are detected for fine-grained categorization (Fig. 2C). Intraoral views comprise 63.5% of the dataset and facial images 5.0% (deidentified upon release). In the labeled subset, most images contain 1 to 3 annotated abnormalities (Fig. 2D). In the Table, we compare MetaDent against existing intraoral image datasets. The key gap that MetaDent uniquely fills is not merely dataset size but the combination of large-scale coverage, semistructured meta-annotation, and broad task generality within a single unified framework.

Figure 2.

Dataset statistics. (A) Overall image area distribution: Image size (in log10 pixels²) shows that most images fall within the 5.0 to 6.0 range. (B) Aspect ratio by data source: DS3 exhibits the widest distribution of aspect ratios, followed by DS1, whereas DS2 displays a more uniform aspect ratio profile. (C) Image content composition: Approximately 80% of images contain human subjects. Of the entire dataset, 63.5% are presumed to be intraoral photographs, and 5.0% show human faces that underwent deidentification prior to dataset release. The presence of key facial landmarks (eyes and nose) is used to further categorize the human-subject images. (D) Annotation count of the labeled subset: Most images contain 1 to 3 labeled abnormalities, as shown by the distribution of annotations per image.

Table.

Comparison between MetaDent and Other Representative Publicly Available Dental Image Datasets, Including Publication Year, Size, Annotation Type, Image Source, Scope, and Whether Each Dataset Supports Classification (CLS), Image Captioning (CAP), and Segmentation (SEG) Tasks.

Dataset	Year	Size	Annotation Type	Image Source	Scope	CLS	CAP	SEG
Oral Images Dataset (Nanditha et al 2020)	2020	323	Image-level category (benign/malignant)	Multicenter	Benign/malignant lesions	√
Oral Cancer (Lips and Tongue) (Shivam and Prakrut 2020)	2020	131	Image-level category (cancer/noncancer)	Multicenter	Lip/tongue cancer	√
Caries-Spectra (Himel et al 2023)	2023	2,000	Image-level category (adv/early/no caries)	Single-center	Caries	√
Teeth or Dental image dataset (Chaudhary et al 2024)	2024	9,562	Image-level category (8 healthy-tooth views)	Single-center	Image view classification	√
AlphaDent (Sosnin et al 2025)	2025	1,320	Instance segmentation masks (9 pathology classes)	Single-center	Tooth pathology	√		√
Annotated intraoral image dataset for dental caries detection (Faizan Ahmed et al 2025)	2025	6,313	Image-level category (5 healthy-tooth views)	Single-center	Caries detection	√		√
CODE—Comprehensive Oral mucosa Database with Explanations (Madan Kumar et al 2025)	2025	4,300	Image-level category (eight standard sites)	Multicenter	Mucosa disease	√	√	√
MetaDent (ours)	2025	60,669 (labeled 2,588)	Semistructured	Single-center and web	General abnormalities	√	√	√

Evaluation of the VLMs

We evaluate VQA performance using accuracy, defined as the proportion of questions answered correctly. This evaluation is performed separately for multiple-choice and true/false question types. For multilabel classification, we report precision, recall, and F1-score, along with Exact Match—the predicted labels exactly match the ground truth. The F1-score, defined as the harmonic mean of precision and recall, reflects the balance between false positives and false negatives. In image captioning, we evaluate the generated captions using LLM-as-a-judge from semantic and diagnostic consistency perspectives. Semantically, we prompt the LLM to generate a reference caption and compare it with the outputs from the VLMs using BERTScore (Zhang et al 2020), which computes token-level contextual embeddings and measures how well the generated caption semantically aligns with the reference description. At the diagnostic consistency level, we instructed the LLM to evaluate whether the generated caption correctly identifies the key diagnostic findings and anatomical location of abnormalities, as specified in the meta-label. This essentially transforms the evaluation into an open-set multilabel classification task. Metrics were computed both per data source and over the full dataset. Detailed definitions of each metric are provided in Appendix Section 2.

The VLMs included in our study represent the current frontiers in vision-language understanding, including proprietary models: GPT-4o-2024-08-06 (OpenAI et al 2024) and Gemini-2.5-flash (Comanici et al 2025), as well as leading open-source models: Qwen3-VL-235B-A22B-Thinking (Yang et al 2025), Ovis2-34B (Lu et al 2024), and Baichuan-Omni-1.5 (Li et al 2025). Several models occasionally produced nonresponses or format-inconsistent outputs; these cases were excluded from analysis, and the exclusion rate is reported in Appendix Table 13. To ensure VLMs did not benefit systematically from sharing linguistic priors with the LLM used in benchmark construction, we repeated the evaluation using an alternative LLM—results showed no noticeable bias (Appendix Section 9).

Results

Datasets

In Figure 1A, image features were extracted using DINOv3 (Siméoni et al 2025) and subsequently dimensionally reduced using principal component analysis (PCA). In the plot, labeled data points are represented by darker, larger dots, while unlabeled raw images appear as lighter, smaller dots. The visualization reveals that the Internet-scraped images (DS3) dominate the dataset and occupy a broader region, indicating they introduce greater visual diversity and cover a wider range of scenarios. Moreover, the labeled data were sampled uniformly from the entire distribution. This justifies that the annotated subset is representative of the overall dataset without significant sampling bias.

To assess benchmark quality, we conducted a human review of the VQA subset (5,176 randomly sampled question-answer pairs from 18,416 total) and the entire classification dataset (2,588). Annotators rated 94.2% (4,875) of VQA pairs as correct; for classification, 90.5% of images (2,340) had perfectly correct labels, while 248 (9.5%) contained 1 or more errors. This evaluation confirms the high overall fidelity of the LLM-generated datasets from the meta-labels. The error analysis for VQA and classification dataset generation is illustrated in Appendix Figure 4; the incorrect responses spanned a mix of the error types.

Figure 4B shows the manual classification of the randomly sampled labeled subset, reflecting overall pathology prevalence: “malocclusion or dental malalignment” is most common, while “residual root” and “oral ulcer” are less frequent.

VQA

All evaluated VLMs achieve moderate accuracy in VQA (Fig. 3A, B). True/false questions (TFQ) consistently receive higher accuracy than multiple-choice questions (MCQ). Gemini-2.5-Flash led performance (MCQ 64.1%, TFQ 67.9%), slightly ahead of GPT-4o (60.6% MCQ, 67.4% TFQ). Open-source models lagged behind: Ovis-2 (61.7% MCQ, 67.3% TFQ), Qwen3-VL (60.4% MCQ, 63.9% TFQ), and Baichuan-Omni (57.5% MCQ, 64.5% TFQ). Performance varied by image source: all models performed slightly better on DS3. Nevertheless, results remained suboptimal, as no model surpassed 68% accuracy on any question type, highlighting the difficulty of fine-grained intraoral VQA and the modest advantage of proprietary over open-source models.

Figure 3.

Performance of 5 vision-language models on MetaDent across 3 tasks. (A) Visual question answering (VQA) accuracy on multiple-choice questions (MCQ). (B) VQA accuracy on true/false questions (TFQ). (C) F1 for the 18-class multilabel classification task. (D) Exact Match accuracy for multilabel predictions. (E) Image captioning semantic similarity reported as BERTScore-F1 against reference captions. (F) Content accuracy of caption-derived abnormalities, evaluated as multilabel F1 after large language model extraction.

Classification

Multilabel classification was challenging for all models (Figs. 3C, D and 4). GPT-4o achieved the highest F1-score (0.437) with balanced precision and recall (0.475/0.438), outperforming Gemini-2.5 (F1 ≈ 0.394, higher recall 0.444 but lower precision 0.398). Open-source models scored lower (F1 ≈ 0.30–0.33). Exact Match accuracy was very low (GPT-4o 24.5%; others <16%), showing models rarely predicted all labels correctly. Overall, even the best model detected less than half of all findings.

Figure 4.

Category-level performance for the multilabel classification task. (A) Exact Match accuracy across 18 categories for 5 models. (B) Distribution of samples per category for the classification task. (C) Names of the 18 categories (see Appendix Table 1 for details). (D) Per-class precision, recall, and F1 across datasets. The “Total” column reports the overall metrics aggregated across all classes, computed by first evaluating the metric at the sample level and then averaging across all samples.

Image Captioning

Models struggled with accurate free-form captions (Fig. 3E, F). Gemini-2.5 achieved the highest semantic similarity to references (BERTScore-F1 ≈ 0.209), followed by Qwen3-VL (0.206), GPT-4o (0.203), Ovis-2 (0.155), and Baichuan-Omni (0.129). Appendix Section 9 reveals slightly different styles: GPT-4o (P ≈ 0.112, R ≈ 0.072) and Ovis (P ≈ 0.108, R ≈ 0.084) favored precise captions but led to less sensitivity (low recall), while the others were more balanced. Consistency remained low across all models (F1 ≈ 13% at best), indicating most key findings were missed or incorrectly interpreted. In summary, captions generated by VLMs are often unreliable for clinical interpretation.

Discussion

In this study, we focus on the limitations of VLMs for intraoral image understanding and propose a semistructured meta-labeling technique to support analysis of clinical images. Unlike conventional categorical or textual labels, our approach generates rich, hierarchical representations that capture semantic meaning at arbitrary levels of detail, enabling effective transfer to downstream tasks. To support this work, we collected a diverse dataset of dental images primarily from online sources and carefully annotated a subset to evaluate representative VLMs. Our results show that state-of-the-art models struggled across all tasks—barely reaching about 65% to 70% accuracy in VQA and around 0.4 F1 in multilabel classification—and their image captions often miss important findings. We further observe a sharp performance degradation on DS2, suggesting the presence of a domain shift. A more detailed analysis of the challenges associated with DS2 indicates that the performance drop may be attributed to shifts in image tone and demographic differences between datasets (detailed analysis in Appendix Section 3). These findings highlight that current VLMs, even cutting-edge systems such as GPT-4o, are not yet reliable for deployment in clinical settings without further refinement.

In contrast to prior studies that typically rely on photographs from a single institution or a limited number of clinical centers, we adopted a different approach by primarily utilizing web-scraped images. Specifically, we used a filtered subset of the COYO-700M dataset, which was originally sourced from Common Crawl (2008)—a large-scale web archive. This set of images contains diverse photos compared to single-center or public single-source datasets, as shown in Figure 1A. This diversity in imaging conditions, lighting, and patient demographics enhances the generalizability of the dataset, reducing bias and enabling better real-world applicability.

Another contribution of this study lies in the labeling protocol. The advantages of the proposed technique are as follows:

It effectively captures the hierarchical complexity of intraoral conditions that conventional categorical labels cannot represent. For example, a full crown may vary in material, color, and defects, which cannot be documented comprehensively with a single category and can be described simultaneously within our structure (Appendix Section 7). This enables richer, more precise supervision signals for VLMs.

Compared with free-form captions commonly used in general-domain VLMs, the proposed scheme offers a more efficient and clinically meaningful annotation process. By focusing on abnormalities rather than exhaustive descriptions, it reduces redundancy. Normal conditions can be inferred from the absence of abnormalities, minimizing annotation workload without compromising interpretability. Moreover, the point-by-point label structure also facilitates the attachment of metadata (e.g., bounding boxes, segmentations) for downstream applications.

By balancing comprehensiveness with conciseness, the proposed method provides a compact yet comprehensive annotation of the image by a brief summary of the main visual content and a structured list of free-form abnormalities. By leveraging the strong reasoning capabilities and dental knowledge of LLMs, we can easily scale the dataset to different formats for downstream tasks. These models can not only assist in diagnosis to reduce errors but also enable applications such as smart health care management, early-stage oral disease screening, and self-conducted oral health checks at home. Some of the potential use cases are listed in Appendix Section 8.

The primary limitation of this work lies in the relatively small size of the labeled dataset, which is insufficient to claim broad coverage of intraoral conditions and is hard to support fine-tuning of large, domain-specific VLMs. Additionally, most of the images are web-scraped, introducing variability in quality and uncertain provenance. The labels combine human annotations with LLM-generated outputs, which may introduce noise. That said, LLM-assisted data generation and quality control are common in related research, and we quantitatively assessed error sources to support a certain level of confidence in the dataset’s overall reliability. As large models continue to advance, these techniques are expected to yield even higher-quality data. Lastly, while our benchmark is in large quantity and reasonable quality, we acknowledge the absence of direct comparison to human expert performance.

Despite these constraints, our benchmark offers a robust and challenging evaluation platform for multimodal models in dentistry. We release the dataset, an initial set of annotations, together with the labeling tools, hoping this work will encourage community-driven expansion, supporting fine-tuning, regulatory validation, and ultimately aid the development of more robust and practically useful models in oral health.

Conclusion

In this work, we proposed a semistructured annotation framework for intraoral image analysis that enables rich, scalable, and task-agnostic representations. By curating a diverse dataset and standardized benchmarks, we evaluated state-of-the-art vision-language models and revealed their limitations in fine-grained dental understanding. The result reveals that the gap between current VLM capabilities and dental requirements remains wide. Closing this gap requires concerted efforts and cross-disciplinary collaboration between AI researchers and dental practitioners. By providing the community with a roadmap and tools, we aim to help such interdisciplinary synergy to advance multimodal AI in dentistry—ultimately moving us closer to practically useful multimodal AI in oral health care.

Author Contributions

M.-X. Li, W.-H. Deng, contributed to conception and design, data acquisition, analysis, and interpretation, drafted and critically revised the manuscript; Z.-X. Wu, C.-X. Jin, contributed to data acquisition, analysis, and interpretation, critically revised the manuscript; J.-M. Wu, J. K. H. Tsoi, contributed to data conception and design, critically revised the manuscript; Y. Han, contributed to data analysis, drafted and critically revised the manuscript; G.-S. Xia, C. Huang, contributed to data conception and design, drafted and critically revised the manuscript. All authors gave final approval and agree to be accountable for all aspects of the work.

Supplemental Material

sj-docx-1-jdr-10.1177_00220345261424242 – Supplemental material for MetaDent: Labeling Clinical Images for Vision-Language Models in Dentistry

Supplemental material, sj-docx-1-jdr-10.1177_00220345261424242 for MetaDent: Labeling Clinical Images for Vision-Language Models in Dentistry by M.-X. Li, W.-H. Deng, Z.-X. Wu, C.-X. Jin, J.-M. Wu, Y. Han, J. K. H. Tsoi, G.-S. Xia and C. Huang in Journal of Dental Research

Footnotes

Acknowledgements

The authors gratefully acknowledge Chao Pang for helpful discussions and Zhong-Shi Zhang, Yu-Jie Wu, and Mu-Qi Jiang for their valuable contributions to data processing and their unwavering support throughout this work.

Declaration of Conflicting Interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This study was funded by the National Natural Science Foundation of China (82401200), Natural Science Foundation of Hubei Province (2024AFB033), Key R&D Program of Hubei Provincial Department of Science and Technology (2023BAB058), and National College Students Innovation and Entrepreneurship Training Program (202510486171).

ORCID iDs

M.-X. Li

W.-H. Deng

J. K. H. Tsoi

C. Huang

Data Availability

We publicly release the full dataset, benchmark, and the labeling tools at the project website: . To protect privacy, all collected images, including both internal clinical images and externally sourced open-access images, have undergone facial anonymization.

A supplemental appendix to this article is available online.

References

Byeon

, et al. 2022. COYO-700M: image-text pair dataset. Github; [accessed 2024 Oct 13]. https://github.com/kakaobrain/coyo-dataset

Caron

Decup

Grosgogeat

Chacun

2025. The relevance of intraoral scanner (IOS) for periodontal diagnosis: a scoping review. J Dent. 160:105824. https://doi.org/10.1016/j.jdent.2025.105824

Chaudhary

, et al. 2024. Varying views of maxillary and mandibular aspects of teeth: a dataset. Data Brief. 56:110772. https://doi.org/10.1016/j.dib.2024.110772

Comanici

, et al. 2025. Gemini 2.5: pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities [preprint]. arXiv. https://doi.org/10.48550/arXiv.2507.06261

Common Crawl. 2008. Common crawl—open repository of web crawl data. Common Crawl; [accessed 2024 Oct 14]. https://commoncrawl.org

Ding

, et al. 2023. Artificial intelligence in dentistry—a review. Front Dent Med. 4:1085251. https://doi.org/10.3389/fdmed.2023.1085251

Dot

, et al. 2024. DentalSegmentator: robust open source deep learning-based CT and CBCT image segmentation. J Dent. 147:105130. https://doi.org/10.1016/j.jdent.2024.105130

Faizan Ahmed

, et al. 2025. Annotated intraoral image dataset for dental caries detection. Sci Data. 12(1):1297. https://doi.org/10.1038/s41597-025-05647-9

Ghosh

, et al. 2024. CLIPSyntel: CLIP and LLM synergy for multimodal question summarization in healthcare. Proc AAAI Conf Artif Intell. 38(20):22031–22039. https://doi.org/10.1609/aaai.v38i20.30206

10.

Hao

, et al. 2025. Towards better dental AI: a multimodal benchmark and instruction dataset for panoramic X-ray analysis [preprint]. arXiv. https://doi.org/10.48550/arXiv.2509.09254

11.

Haviana

SFC

Kurniadi

2016. Average hashing for perceptual image similarity in mobile phone application. J Telemat Inform. 4(1):12–18.

12.

Himel

GMS

Islam

Hannan

. 2023. Caries-Spectra: a dataset of enamel caries. Mendeley Data. V1. https://doi.org/10.17632/9jnf2jvghy.1

13.

Huang

, et al. 2024. A multimodal dental dataset facilitating machine learning research and clinic services. Sci Data. 11(1):1291. https://doi.org/10.1038/s41597-024-04130-1

14.

Jeong

Garg

Lipton

Oberst

2024 Nov 12–16. Medical adaptation of large language and vision-language models: are we making progress? In: Al-Onaizan

Bansal

Chen

Y-N

, editors. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing; Miami, FL. Association for Computational Linguistics. p 12143–12170. https://doi.org/10.18653/v1/2024.emnlp-main.677

15.

, et al. 2023 Dec 10–16. LLaVA-Med: training a large language-and-vision assistant for biomedicine in one day. In: Oh

, et al., editors. Proceedings of the 37th International Conference on Neural Information Processing Systems; New Orleans, LA. Curran Associates, Inc. p 28541–28564.

16.

, et al. 2025. Baichuan-Omni-1.5 technical report [preprint]. arXiv. https://doi.org/10.48550/arXiv.2501.15368

17.

, et al. 2024 Nov 12–16. Enhancing advanced visual reasoning ability of large language models. In: Al-Onaizan

Bansal

Chen

Y-N

, editors. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing; Miami, FL. Association for Computational Linguistics. p 1915–1929. https://doi.org/10.18653/v1/2024.emnlp-main.114

18.

Liu

Lee

. 2023. Visual instruction tuning. Adv Neural Inform Process Syst. 36:34892–34916.

19.

, et al. 2024. Ovis: structural embedding alignment for multimodal large language model [preprint]. arXiv. https://doi.org/10.48550/arXiv.2405.20797

20.

Madaan

, et al. 2023 Dec 10–16. SELF-REFINE: iterative refinement with self-feedback. In: Oh

, et al., editors. Proceedings of the 37th International Conference on Neural Information Processing Systems; New Orleans, LA. Curran Associates, Inc. p 46534–46594.

21.

Madan Kumar

, et al. 2025. CODE—comprehensive oral mucosa database with explanations [dataset]. Figshare; [accessed 2025 Dec 13]. https://doi.org/10.6084/m9.figshare.30550889.v2

22.

Mania

Nanobashvili

Manjavidze

Benashvili

Astamadze

2025. Diagnostic accuracy of intraoral mobile photography for oral health screening in children: a pilot study. BMC Oral Health. 25(1):1144. https://doi.org/10.1186/s12903-025-06500-6

23.

Meng

, et al. 2025. DentVLM: a multimodal vision-language model for comprehensive dental diagnosis and enhanced clinical practice [preprint]. arXiv. https://doi.org/10.48550/arXiv.2509.23344

24.

Nanditha

Dinesh

Murali

Chandrasheka

. 2020 Jul 2–4. Texture analysis of color oral images for lesion detection. In: 2020 International Conference on Computational Performance Evaluation (ComPE); Shillong, India. IEEE. p 67–72. https://doi.org/10.1109/ComPE49325.2020.9200023

25.

Nath

, et al. 2025 Jun 10–17. VILA-M3: enhancing vision-language models with medical expert knowledge. In: Proceedings of the 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition; Nashville, TN. IEEE. p 14788–14798. https://doi.org/10.1109/CVPR52734.2025.01378

26.

Nguyen

, et al. 2025. SegmentAnyTooth: an open-source deep learning framework for tooth enumeration and segmentation in intraoral photos. J Dent Sci. 20(2):1110–1117. https://doi.org/10.1016/j.jds.2025.01.003

27.

OpenAI et al. 2024. GPT-4o system card [preprint]. arXiv. https://doi.org/10.48550/arXiv.2410.21276

28.

OpenAI et al. 2025. gpt-oss-120b & gpt-oss-20b Model Card. [preprint]. arXiv. https://doi.org/10.48550/arXiv.2508.10925

29.

Radford

, et al. 2021. Learning transferable visual models from natural language supervision. In: Proceedings of the 38th International Conference on Machine Learning. PMLR. 139:8748–8763.

30.

Ryu

Kang

Chu

Yang

2025. Vision-language foundation models for medical imaging: a review of current practices and innovations. Biomed Eng Lett. 15(5):809–830. https://doi.org/10.1007/s13534-025-00484-6

31.

Shivam

Prakrut

2020. Oral cancer (lips and tongue) images. Kaggle; [accessed 2024 Oct 13]. https://www.kaggle.com/datasets/shivam17299/oral-cancer-lips-and-tongue-images

32.

Siméoni

, et al. 2025. DINOv3 [preprint]. https://doi.org/10.48550/arXiv.2508.10104

33.

Sosnin

, et al. 2025. AlphaDent: a dataset for automated tooth pathology detection [preprint]. arXiv. https://doi.org/10.48550/arXiv.2507.22512

34.

Uribe

, et al. 2024. Publicly available dental image datasets for artificial intelligence. J Dent Res. 103(13):1365–1374. https://doi.org/10.1177/00220345241272052

35.

Wang

Sun

2025. Tooth segmentation on multimodal images using adapted segment anything model. Sci Rep. 15(1):13874. https://doi.org/10.1038/s41598-025-96301-2

36.

Yang

, et al. 2025. Qwen3 technical report [preprint]. arXiv. https://doi.org/10.48550/arXiv.2505.09388

37.

Zhang

Kishore

Weinberger

Artzi

2020 Apr 26–30. BERTScore: evaluating text generation with BERT. In: Proceedings of the 8th International Conference on Learning Representations (ICLR); Addis Ababa, Ethiopia. Curran Associates, Inc. p 5333–5375.