Sage Journals: Discover world-class research

Abstract

Objective

Poor conditions in the intraoral environment often lead to low-quality photos and videos, hindering further clinical diagnosis. To restore these digital records, this study proposes a real-time interactive restoration system using segment anything model.

Methods

Intraoral digital videos, obtained from the vident-lab dataset through an intraoral camera, serve as the input for interactive restoration system. The initial phase employs an interactive segmentation module leveraging segment anything model. Subsequently, a real-time intraframe restoration module and a video enhancement module were designed. A series of ablation studies were systematically conducted to illustrate the superior design of interactive restoration system. Our quantitative evaluation criteria contain restoration quality, segmentation accuracy, and processing speed. Furthermore, the clinical applicability of the processed videos was evaluated by experts.

Results

Extensive experiments demonstrated its performance on segmentation with a mean intersection-over-union of 0.977. On video restoration, it leads to reliable performances with peak signal-to-noise ratio of 37.09 and structural similarity index measure of 0.961, respectively. More visualization results are shown on the https://yogurtsam.github.io/iveproject page.

Conclusion

Interactive restoration system demonstrates its potential to serve patients and dentists with reliable and controllable intraoral video restoration.

Keywords

Biomedical image processing digital dentistry deep learning intraoral digital videos segment anything model video restoration

Introduction

In response to the escalating requisites for dental healthcare, the application of computer-aided design systems and digital dentistry is gaining significant traction.^1,3,5,2,4 Intraoral digital photos and videos have been increasingly used in clinical practice as auxiliary and informative methods for documenting the progress of a patient’s treatment and recording their clinical condition.^6,7 They can help assess various aspects, such as the extent of caries,^8,9 tooth wear rates,¹⁰ restorations,^6,11,12 staining,¹³ and demineralization.¹⁴ In comparison to the photos, intraoral videos offer manifold advantages. The videos provide real-time visualization, allowing dentists to observe and assess live images during dental examinations or procedures. These videos can also capture dynamic movements within the oral cavity, allowing dentists to assess occlusion (bite) and jaw movements during different functions like chewing or speaking. This information is essential for orthodontic and prosthodontic treatments. For the patients, seeing live images of their oral health conditions can build trust and confidence in the dentist’s expertise, fostering a positive patient-dentist relationship. In addition, these videos enable remote dental examination and cross-checking from specialist diagnostic support, that is, “store and forward” telemedicine.

In contemporary clinical practice, dentists use various instruments to obtain intraoral digital photos and videos, including single-lens reflex (SLR),¹⁵ dental operating microscope (DOM),¹⁶ intraoral camera (IOC),¹⁷ and even smartphone.¹⁸ While both SLR and DOM are recognized as the most standardized methods in dental practice, their elevated costs, labor-intensive nature, and technical intricacies may constrain their application. For example, when employing an SLR to take a photo of the patient’s occlusal surface, the involvement of an assistant is essential. This collaboration entails the use of instruments such as mouth openers and reflectors, as the operator alone is unable to manage all aspects of the process.¹⁵ Blind use of DOM without professional training and adequate practice may hinder results and burden treatment.¹⁹ Meanwhile, both of these devices necessitate patient cooperation and a considerable degree of mouth opening. In this condition, integrating an IOC with an intraoral instrument, such as a dental handpiece, empowers a dental practitioner to maintain constant surveillance over the process of the procedure. The IOC reduces the difficulty of recording in the clinical consultation without adequate conditions.²⁰ IOC functionalities have expanded to encompass a diverse array of features, including macro mode for magnification, a curing light for composites, light emitting diode lights, capabilities for picture or video recording, and fluorescence for detecting various stages of caries, plaque, and gingival inflammation.²¹

However, in clinical dentistry, the intraoral environment introduces additional complexities that can pose challenges, such as lip shadow, specular reflection, non-centered teeth, and tremulous images.¹⁸^,²² These challenges pose obstacles to the extended utilization of photography in dental diagnosis. For instance, tremulous images may introduce ghosting artifacts. In addition, unpredictable variations in illumination, combined with the interference of fluids like water and saliva, present substantial challenges in preserving image fidelity.¹⁸^,²³ These unintended distortions influence the interpretation of the resulting images, potentially affecting diagnostic precision and therapeutic decision-making. Compared with static photos, intraoral videos capture a broader range of details and scenes throughout the treatment process. Nonetheless, the quality of videos is more susceptible to the aforementioned factors compared to static photos. These challenges emphasize the necessity that restore and enhance the captured videos, ensuring their reliability and clinical utility.

To address the above issues, the utilization of video restoration and enhancement techniques emerges as a pivotal strategy. Previous approaches employ deep neural networks to restore the video frames automatically.^23–25 Though straightforward, they encounter challenges in accurately identifying and restoring significant tooth regions amidst complex backgrounds.^26,27 These automatic methods^28,29 have not exhibited sufficiently accurate and robust results for clinical use, primarily due to the inherent complexities in intraoral scenarios. Consequently, expert intervention is frequently required for post-hoc correction. In practical applications, the absence of human interaction makes it unfeasible to reliably control the results of video restoration. In addition, they need to solve multi-task optimization problems, which hampers their effectiveness in real-time applications within digital dental workflows.

The aim of this study is to investigate the reliable restoration and enhancement of intraoral digital videos through an interactive and real-time system. The proposed system integrates interactive segmentation as a preliminary step, leveraging the advancements of the recent segment anything model (SAM).³⁰ SAM, as the pioneering foundational model for general image segmentation, demonstrates robust capabilities in generating accurate object masks through interactive prompts provided by users (e.g. bounding boxes and click points). The entire system comprises three modules: an interactive segmentation module utilizing SAM, an intraframe restoration module, and an interframe enhancement module. This system can achieve real-time restoration of specific regions through interactive prompts from users (i.e. doctor-in-the-loop^31,32). Furthermore, to attain additional video enhancement, we delve into the potential of artificial intelligence generated content (AIGC)^33–35 for clinical applications in the interframe enhancement module. A series of video super-resolution^36,37 and video interpolation^38,39 approaches are further employed¹ . Finally, both video frame restoration quality and real-time performance are evaluated. The proposed system is flexible and can seamlessly integrate more advanced SAM and AIGC approaches in future iterations.² Extensive experiments demonstrate that it achieves reliable performance on intraoral video restoration. The project page has been released to further support and promote the research efforts.⁴⁰

Methods

Overview

The proposed system takes low-quality videos captured by an intraoral camera as input. Later, the system performs interactive video restoration guided by user prompts. This system is evaluated using vident-lab dataset,⁴¹ which provides paired low-quality and high-quality videos depicting the same dental scene. To illustrate the superior design of our framework, ablation studies are conducted on each sub-module. We compared the effect of this system with other previous approaches in various aspects, including quantitative analyses such as restoration, segmentation, and processing speed, as well as qualitative evaluation of clinical applicability. More details will be shown in the following sections.

Data acquisition

To obtain the paired low-quality and high-quality intraoral videos for training, a micro-camera and a high-definition camera with larger sensors and optics are tightly coupled through a $50 / 50$ beam splitter. This configuration allows for the simultaneous acquisition of both low-quality and high-quality frames. The vident-lab dataset⁴¹ is proposed for multi-task intraoral video inference in a clinical scenario, encompassing the tasks of restoration and tooth segmentation. The dataset consists of training, validation, and test sets, comprising 300, 29, and 80 videos, respectively. These videos correspond to a total of 60K frames with 300 segmentation masks for the training set, 5.6K frames with 116 masks for the validation set, and 15.5K frames with 320 masks for the test set.²³

Framework of interactive restoration system (IARS)

The proposed framework contains an interactive segmentation module using SAM, an intraframe restoration module, and an interframe enhancement module. Figure 1 illustrates the whole system. The input comprises low-quality videos of the vident-lab dataset for training, while the corresponding high-quality videos are regarded as the ground truth. Figure 2 shows the system design and architecture.

Figure 1.

Workflow of the proposed system. The system takes intraoral video frames as input and produces the enhanced videos. Following interactive segmentation and intraframe restoration, “Human Improvement” is executed iteratively on intraoral videos through prompts (doctor-in-the-loop) until “Expert Evaluation” attains high quality.

Figure 2.

System design and architecture. (a) The architecture of the interactive segmentation module. (b) The architecture of intraframe restoration module. “Fusion” denotes the concatenation between region-level and entire frame features from intermediate layers. “Conv” means the convolutional layer. (c) The architecture of the interframe enhancement module. The input is restored video frames with low frame rate (T) and resolution from (b). The output is enhanced videos with higher frame rate (T+n) and resolution. “t” means time step. “LR” denotes low resolution. “HR” denotes high resolution.

Interactive segmentation module

Firstly, the low-quality video is divided into multiple frame images with a resolution of $256 \times 256$ . These images are injected into the interactive segmentation module. Upon the user’s input of bounding boxes or click points into the module, segmentation results of the foreground (teeth) and background (oral mucosa, etc.) are generated interactively. Based on the segmentation results, the regions of teeth and background are cropped separately. These cropped region-level images are then resized to $256 \times 256$ and utilized as input, along with the entire frame, for the intraframe restoration module.

Intraframe restoration module

Secondly, the intraframe restoration module contains two branches of restoration networks. One focuses on restoring the region-level images, and the other is dedicated to restoring the entire frame. The two networks share the same architecture but possess independent model weights.²⁵ The region images and the entire frame are fed into the region-level restoration network and the entire frame restoration network, respectively. Smooth L1 loss is employed to optimize the minimization problems arising from the disparity between low-quality and high-quality frame images.⁴² In addition, a fusion strategy using channel concatenation is also employed to combine the region features and entire frame features extracted from the intermediate layers. In this way, the framework is able to focus on the restoration details based on user prompts. With a focus on clinical usability, this interactive process facilitates continuous improvement through iterative input prompts, thereby enhancing the overall performance and reliability of the system.

Interframe enhancement module

Thirdly, given the above-restored video frames, the interframe enhancement module is used to enhance the videos, including video super-resolution^36,37 and video interpolation^38,39 operations. Both of these techniques serve as a significant part of AIGC for videos. Specifically, video super-resolution aims at recovering a high-resolution video from the corresponding low-resolution counterpart, while video interpolation increases the frame rate by generating intermediate frames between consecutive input frames. Here, this study investigates the potential of AIGC on intraoral videos using MMagic toolbox.⁴³ The MMagic toolbox is an open-source AIGC tool equipped with multiple powerful models for image/video processing, editing, and generation. The application of this tool is straightforward and flexible, as it does not require additional annotations for training.

Finally, through the above process, the low-quality intraoral video is restored and enhanced, resulting in a clear reference for clinical purposes.

Ablation studies

Ablation studies are conducted on each sub-module to illustrate the superior design of our framework. Firstly, we compared the impact of different numbers of click points on the effect of segmentation. Then, the segmentation prior (“w/o Seg.,” step 1 in Figure 1) and intermediate layer fusion (“w/o Fusion,” Figure 2(b)) are omitted to verify the necessity of these two steps on restoration. Furthermore, the interactive segmentation is replaced with the automatic method Mask R-CNN^44,45 (“w/ Mask R-CNN”) to prove the significance of the interactive pattern.

Quantitative evaluation of restoration, segmentation, and processing speed

Following the experimental protocol by existing methods,²³ peak signal-to-noise ratio (PSNR) and structural similarity index measure (SSIM) are adopted to evaluate the video frame restoration quality. PSNR is defined based on the mean squared error (MSE). Given an image I and its noisy counterpart K, MSE is defined as follows:

M S E = \frac{1}{m n} \sum_{i = 0}^{m - 1} \sum_{j = 0}^{n - 1} [I (i, j) - K (i, j)]^{2}

(1)where I contains

m \times n

pixels. The PSNR is defined as follows:

P S N R = 20 \cdot \log (\frac{max (I)}{\sqrt{MSE}})

(2)SSIM is used for measuring the similarity between two images

I_{1}

and

I_{2}

S S I M (I_{1}, I_{2}) = l (I_{1}, I_{2})^{α} c (I_{1}, I_{2})^{β} s (I_{1}, I_{2})^{γ}

(3)where l denotes the luminance, c denotes the contrast, and s denotes the structure. In addition, the processing speed is measured in frames-per-second (FPS) to assess real-time performance:

\begin{matrix} F P S = \frac{N_{f r a m e}}{T} \end{matrix}

(4)where

N_{f r a m e}

denotes the number of video frames and

T

is the video duration. For a fair comparison, the above metrics are used to evaluate the results from the intraframe enhancement module. For segmentation, mean intersection-over-union (mIoU) is used as follows:

\begin{matrix} m I o U = \frac{1}{k} \sum_{i = 1}^{k} \frac{P \cap G}{P \cup G} \end{matrix}

(5)where P means the prediction and G means the ground truth. In Table 1, we compare with different approaches, that is, multi–input multi-output U-net (MU),⁴⁸ efficient spatio-temporal recurrent neural network (EN),²⁴ DeepLabv3+ (DLab),⁵² UNet++ (UN),⁴⁷ and MOST-Net (MN)²³ in the aforementioned measurements.

Table 1.

Comparison with methods on restoration, segmentation, and processing speed.

	Methods	Task	PSNR	SSIM	mIoU	FPS
#1	MU ⁴⁸	Auto-Res.	26.66	0.916	–	8.4
#2	EN ²⁴	Auto-Res.	30.72	0.943	–	68.5
#3	DLab ⁵²	Auto-Seg.	–	–	0.968	108.2
#4	UN ⁴⁷	Auto-Seg.	–	–	0.969	38.9
#5	EN+DLab	Auto-Seg.+Res.	30.73	0.943	0.967	28.6
#6	MN ²³	Auto-Seg.+Res.	31.05	0.947	0.946	19.3
#7	Automatic	Auto-Res.	35.24 ± 0.02	0.956 ± 0.01	0.970 ± 0.03	34.6
#8	Click points	Interact-Res.	36.96 ± 0.07	0.958 ± 0.04	0.974 ± 0.10	34.6
#9	Boxes	Interact-Res.	37.09 ± 0.07	0.961 ± 0.02	0.977 ± 0.06	34.6

MU: multi-input multi-output U-net⁴⁸; EN: efficient spatio-temporal recurrent neural network²⁴; DLab: DeepLabv3+⁵²; UN: UNet++⁴⁷; MN: MOST-Net²³. “Auto-Res.,” “Auto-Seg.,” and “Interact-Res.” denote automatic restoration, automatic segmentation, and interactive restoration, respectively. Three click points were used here. “ ± ” means the result ranges of the repeated experiments. PSNR: peak signal-to-noise ratio; SSIM: structural similarity index measure; mIoU: mean intersection-over-union; FPS: frames-per- second.

Visualization and qualitative evaluation of clinical applicability

The qualitative visualization is presented in Figure 3. Three human experts (XC, XL, and YH) who have more than 10 years of clinical experience are blinded. They independently evaluate the clinical applicability of videos processed by the different systems, including the system without segmentation prior (“w/o Seg.”), the system using automatic segmentation (“w/ Mask R-CNN”), and the whole system proposed in our study (Table 2). Based on the repair quality of the tooth area, the human evaluation includes “applicable” and “not applicable” by judging whether these videos can be used for clinical applications. We average the evaluation results of the three experts and calculate the pass rate.

Figure 3.

Segmentation results from comparison among different prompts and ground truth. “3 or 5 points” denotes the number of the point prompt. Please zoom in for best view.

Table 2.

Clinical applicability evaluated by experts on 100 intraoral videos with 20K frames.

Ablation studies	Applicable	Not applicable	Pass rate (%)
w/o Seg.	85	15	85
w/ Mask R-CNN	91.33	8.67	91.33
Whole system	97.33	2.67	97.33

“w/o Seg.”: without segmentation prior (step 1 in Figure 1); “w/ Mask R-CNN”: using an auto-matic segmentation method (step 1 in Figure 1); “whole system”: using the entire workflow in

Figure 1. The evaluation results of the three experts were averaged. R-CNN: region-based convolutional neural network.

Results

Experimental setup

For the interactive segmentation module, the intraoral frame segmentation is evaluated with three settings: click point prompt, bounding box prompt, and auto-prompt. In contrast to the fully automatic setting of auto-prompt, click points and bounding boxes provide an interactive and flexible approach that allows users to incorporate their prior knowledge during the segmentation process. For the learnable intraframe restoration module, the two-branch module is jointly trained for 10 epochs by Adam optimizer⁴⁹ with a learning rate of 1 $\times$ 10⁻⁵. In the training process, it takes about 2 hours on Ubuntu OS with eight RTX 4090 GPUs. In the interframe enhancement module, video super-resolution^36,37 and video interpolation^38,39 approaches are adopted by the MMagic v1.0.0 toolbox.

Interactive segmentation module achieves promising performance on segmentation of intraoral digital videos

Figure 3 presents the segmentation results of the proposed framework for various scenarios. The visualization includes the segmentation results of two interactive prompts (click points and boxes) and a fully automatic method. It demonstrates that the interactive segmentation module based on SAM, whether employed in an interactive or automatic manner, achieves promising performance in segmentation tasks.

Segmentation prior improves the restoration effect

Figure 4 presents the restoration results and corresponding visual explanations using the class activation map from gradient-weighted class activation mapping.^50,51 Thanks to the utilization of the segmentation prior, the proposed system is capable of focusing on the tooth region, resulting in improved restoration output. More randomly chosen visualizations are included on the project page.⁴⁰

Figure 4.

Visualization of automatic restoration (“w/ Mask R-CNN”) and our interactive restoration (“Ours”). The regions that the restoration model focuses on (displayed in red) are shown in the last two columns. “Seg. Results” denotes our interactive segmentation results. “w/o Seg.” means restoration without segmentation prior. From the fourth column and the last column, there is a correlation between the segmentation results and restoration regions. R-CNN: region-based convolutional neural network.

Ablation studies verify the rationality of IARS design

Figure 5(a) demonstrates that the segmentation effect improves as the number of click points increases. In Figure 5(b) and (c), omitting the segmentation prior (“w/o Seg.,” step 1 in Figure 1) or intermediate layer fusion (“w/o Fusion,” Figure 2(b)) of the intraframe restoration module leads to inferior performance. This is primarily attributed to the incomplete restoration of certain regions in intraoral videos. Moreover, when replacing the interactive segmentation with the automatic method Mask R-CNN⁵² (“w/ Mask R-CNN”), the performance deteriorates, which demonstrates the significance of the interactive pattern.

Figure 5.

Ablation studies on interactive segmentation module and intraframe restoration module. (a) Different number of click points. (b) and (c) w/o Seg.: without segmentation prior; w/o Fusion: without fusion; w/ mask R-CNN: using an automatic method to replace the interactive segmentation. R-CNN: region-based convolutional neural network.

IARS shows significant advantages in restoration, segmentation, and processing speed

A comprehensive quantitative analysis is conducted to evaluate the video restoration quality, segmentation accuracy, and real-time processing speed. First, the proposed framework with different prompts is compared with the single-task approaches of restoration and segmentation, respectively. The framework outperforms the single-task baselines significantly in both aspects. In Table 1, the results revealed that IARS with the box prompt (#9) achieved the highest performance with a mIoU of $0.977$ , while DeepLabv3+⁵² (#3) and UNet++⁴⁷ (#4) are $0.968$ and $0.969$ , respectively. On the video restoration, the proposed framework with different prompts (#7–9) exceeds MIMO-UNet (#1) and ESTRNN (#2) by a large margin. It presents that IARS with user’s prompts can achieve reliable performance. Leveraging this segmentation prior can further benefit the video restoration. Then, a comparison is made with ESTRNN+DeepLabv3+ (#5) and MOST-Net²³ (#6), which utilizes multi-task learning for both segmentation and video restoration. IARS surpasses them across all evaluated metrics. Moreover, the reported FPS highlights the real-time clinical application potential.

IARS shows certain clinical applicability

The clinical applicability evaluation is detailed in Table 2. For the system without segmentation prior (“w/o Seg.”), 85% of the restored intraoral videos are deemed clinically applicable by human experts. After adding the automatic segmentation (“w/ Mask R-CNN”), the qualification rate has increased by $6.33$ %. By leveraging the whole system, some problematic cases could be corrected by repeating the interactive restoration workflow (Figure 1), leading to a satisfactory rate of $97.33$ %.

Discussion

In this study, we proposed a real-time IARS for intraoral videos, consisting of an interactive segmentation module, an intraframe restoration module, and an interframe enhancement module. Our study demonstrated that due to its rational design, IARS makes better performance than other previous methods in both segmentation and restoration, with good efficiency. Moreover, the restored videos have certain clinical applicability after blind evaluation by experienced human experts. The key insights are two-fold. Firstly, we introduce interactive segmentation as the initial step, which enables the precise identification of tooth and background (oral mucosa, gingiva, etc.) regions. Secondly, the system employs a powerful foundation model SAM,³⁰ which only needs user’s prompts (click points or bounding boxes) to correct the segmentation results without additional multi-task optimization, thereby ensuring real-time processing. The segmentation results will be entered into the following restoration module and guide the restoration of tooth regions accurately.

Challenges such as low-light conditions and blur in the videos may complicate the photographic diagnosis process for dentists, thereby impeding its clinical application.^6,22,23,53 To address the issue of low-quality intraoral videos, IARS incorporates interactive segmentation prior, which enables users to intervene in the process. Effective segmentation enables the extraction of both foreground (tooth-level) and background (oral mucosa, gingiva, etc.) regions with accuracy. Previous studies^23–25 have mainly focused on automatic restoration techniques with the multi-task learning network and demonstrated their effectiveness. However, these studies have not sufficiently showcased accurate restoration for tooth regions due to the complex intraoral environment. There is a lack of explicit complementarity and guidance between these independent tasks. In contrast, the proposed system introduces interactive segmentation as the initial step, which enables the precise identification of tooth and background regions and instructs the following restoration.

To the best of our knowledge, this study represents the first attempt at interactive restoration of intraoral videos. Our results support that the design of segmentation prior can improve the effect of video restoration, both in the conditions being compared with the single-task approaches of restorations and omitting the segmentation module. Besides, the interactive segmentation effect is better than the automatic methods. In this system, the user plays a crucial role in controlling the regions of video restoration and correcting any errors (i.e. doctor-in-the-loop ^31,32). In a word, it is important to emphasize that the segmentation provides valuable guidance for video restoration (Figure 4). Improved segmentation performance directly correlates with greater gains in video restoration.

Another contribution of this study is the exploration of SAM,^34,36–39 a recent deep learning-based large model. Previous models^44,46 are often tailored to specific domains and targets, and their generalization ability is limited. As the first promptable foundation model for image segmentation, SAM is trained using an extensive dataset that encompasses an unprecedented number of images and annotations. This results in a considerable potential for SAM to successfully perform object segmentation even for object types that it has not seen, that is, strong zero-shot generalization. Extensive demonstrations have showcased the successful segmentation capabilities of SAM across diverse scenarios. There have been some studies demonstrating the application of this model and its variants in the domain of medical images. Different from automatic segmentation models, SAM can generate accurate segmentation results conditioned by the user prompts. These prompts can be provided in the form of a set of click points or a bounding box, which simply specifies what to segment in an image. Specifically, SAM utilizes a vision transformer-based image encoder to compute an image embedding, while employing a prompt encoder to embed prompts and integrate user interactions. Subsequently, the extracted information from both encoders is fed into a lightweight mask decoder to generate segmentation results. As reported, the prompt encoder and mask decoder predict segmentation results from a prompt within $\sim$ 50 ms based on the image embedding. There is no doubt that SAM owns the potential to support this proposed system to complete the segmentation of the intraoral images in a very short time following users’ prompts, which is of great significance for realizing real-time interactive restoration of intraoral digital videos.

In this study, the potential of SAM in intraoral video segmentation and restoration is investigated through a series of experiments divided into three parts. Initially, SAM is evaluated in the auto-prompt mode, revealing that its zero-shot generalization falls short of competing with baseline models. Then, the box-prompt mode and point-prompt mode are examined to assess their performance on segmentation. The extensive experiments demonstrate that the box-prompt mode achieves higher mIoU accuracy than other prompt modes (Table 1). For point-prompt mode, various settings involving different number of points are evaluated (Figure 5). As the number of prompt points increases, the performance gradually surpasses other configurations. We use SAM for interactive segmentation to guide the precise restoration of intraoral videos without additional multi-task optimization, ensuring the process is real-time.

Furthermore, distinct from previous approaches, this system is flexible to address the task by organizing cooperation among models, without being constrained by any elaborate module or rigid framework. More noteworthy, the design of this system (Figure 1) enables the pipeline to leverage a diverse range of powerful AIGC techniques (i.e. video super-resolution^36,37 and video interpolation^38,39). Our study has shown that the enhanced digital videos have certain clinical applicability.

However, actually, IARS is not only limited to intraoral videos captured by IOC but is suitable for both images and videos shot with any tools that need to be restored due to factors such as light and jitter. In this investigation, we have evaluated the segmentation, restoration, and processing speed among the proposed system and other previous methods. This evaluation only assesses the effect of the previous two modules (Figure 2(a) and (b)) for multiple frame images divided from videos. Therefore, we can find that the combination of the interactive segmentation module and intraframe restoration module in this system has shown great restoration capabilities for images. Given the flexibility of the system, we may be able to build a more powerful restoration system with better performance for intraoral digital photography in the future.

IARS is mainly aimed at the restoration of intraoral digital videos, with a certain potential to restore images. One of our ideal clinical applications is to integrate it into an IOC. In the late 1980s, the inaugural IOC emerged. Subsequently, iterative refinement and enhancement by multiple manufacturers culminated in sophisticated high-end IOC.²¹ Presently, IOC is a small handheld device that is ergonomic, lightweight, comfortable to use, relatively inexpensive, and can capture images and videos that are readily available for the patient and the clinician which can be magnified and viewed. Some studies⁵⁴ have highlighted the utility of oral endoscopy as a valuable adjunct, aiding physicians in recording treatment progress, archiving images, and facilitating post-treatment follow-up. While shooting, patients can engage in real-time visualization of their oral status on a screen, fostering effective doctor-patient communication. One of the advantages of IOS is that it can easily penetrate deep into the inside of the mouth to shoot at specific areas. But its shortcomings are also obvious. Affected by various factors in the oral cavity easily, especially poor lighting and possible shaking when holding by hand, the quality of the pictures and videos it captures may be reduced. If we can integrate IARS into the clinical application of IOC, it may be possible to improve the quality of the pictures and videos output by IOC under doctor’s real-time interactive prompts, optimizing its clinical application effect and expanding its clinical practicability. Besides, it will be another good idea to integrate it into a smartphone camera application.

On the whole, this research has preliminarily proved the following: firstly, the large basic model SAM can realize real-time interactive segmentation of intraoral digital videos; secondly, the interactive segmentation prior helps guide a more precise restoration process; thirdly, the proposed system is expected to be integrated into digital dental applications, contributing to the prevention, diagnosis, treatment, and even remote treatment of the oral diseases.

There are nevertheless some limitations in this study. Firstly, the limitations of the dataset should be acknowledged. The dataset has a limited number of videos. The overall quality of the videos is poor, which may affect the segmentation and restoration of the system. Also, the videos are shot from a single angle, which may not be able to fully simulate the real scenario or examine the ability of the system to optimize the videos from other angles. Second, several studies⁵⁵ have shown that SAM may fail in some challenging scenarios, especially when the targets have weak boundaries. This is not surprising since SAM is mainly trained on natural image datasets where the objects usually have strong edge information.

In the future, the next step is to collect more representative datasets for training and testing the system to further optimize its performance. Moreover, considering that photographs are the main form of clinical monitoring for doctors, we will also do further optimization to enhance its function of image restoration. Meanwhile, given that IOC and smartphones are complementary to the use of SLR and DOM in certain conditions, the system may be integrated into an IOC program or smartphone camera application in the future for doctors and patients.

Conclusion

This study presents a real-time interactive and clinically applicable system named IARS for the restoration of intraoral digital videos. IARS significantly outperforms previous methods. It can be seamlessly integrated into real-world clinical intraoral cameras, benefiting both patients and dentists. In this investigation, we emphasize the necessity of interactive segmentation prior in restoration. Furthermore, this study explores the potential of utilizing the deep learning-based large foundation model SAM in digital dentistry.

Supplemental Material

sj-docx-1-dhj-10.1177_20552076241269536 - Supplemental material for A real-time interactive restoration system for intraoral digital videos using segment anything model

Supplemental material, sj-docx-1-dhj-10.1177_20552076241269536 for A real-time interactive restoration system for intraoral digital videos using segment anything model by Yongjia Wu, Li Zeng, Yaya Hong, Xiaojun Li and Xuepeng Chen in DIGITAL HEALTH

Footnotes

Acknowledgements

Not applicable.

Contributorship

The project was conceptualized by Yongjia Wu, Xiaojun Li, and Xuepeng Chen. Yongjia Wu, Li Zeng, and Yaya Hong designed the study. Yongjia Wu acquired and analyzed the data, and wrote the original manuscript. Yongjia Wu, Li Zeng, and Yaya Hong. prepared figures and tables. Li Zeng, Yaya Hong, Xiaojun Li, and Xuepeng Chen revised the manuscript. All authors reviewed and edited the manuscript and approved the final version of the manuscript.

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Ethical approval

Not applicable.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This research was funded by National Natural Science Foundation of China, grant number 81400511; Key R&D Program of Zhejiang, grant numbers 2022C03088, 2023C03072; Zhejiang Provincial Natural Science Foundation of China, grant number LY18H140001; R&D Program of the Stomatology Hospital of Zhejiang University School of Medicine, grant numbers RD2022DLYB03, RD2022JCEL04.

Guarantor

XC and XL.

ORCID iD

Yongjia Wu

Supplemental material

Supplemental material for this article is available online.

Notes

References

Kühnisch

Meyer

Hesenius

, et al. Caries detection on intraoral images using artificial intelligence. J Dent Res 2022; 101: 158–165.

Di Fede

Panzarella

Buttacavoli

, et al. Doctoral: a smartphone-based decision support tool for the early detection of oral potentially malignant disorders. Digit Health 2023; 9: 20552076231177141.

Jin

Han

Huang

, et al. Automatic three-dimensional nasal and pharyngeal airway subregions identification via vision transformer. J Dent 2023; 136: 104595.

Liu

Yang

, et al. A survey of artificial intelligence in tongue image for disease diagnosis and syndrome differentiation. Digit Health 2023; 9: 20552076231191044.

Shan

Tay

. Application of artificial intelligence in dentistry. J Dent Res 2021; 100: 232–244.

Signori

Collares

Cumerlato

CBF

, et al. Validation of assessment of intraoral digital photography for evaluation of dental restorations in clinical research. J Dent 2018; 71: 54–60.

Engels

Meyer

Schonewolf

, et al. Automated detection of posterior restorations in permanent teeth using artificial intelligence on intraoral photographs. J Dent 2022; 121: 104124.

Mosavat

Ahmadi

Amirfarhangi

, et al. Evaluation of diagnostic accuracy of CBCT and intraoral radiography for proximal caries detection in the presence of different dental restoration materials. BMC Oral Health 2023; 23: 419.

Park

Cho

Kang

, et al. Caries detection with tooth surface segmentation on intraoral photographic images using deep learning. BMC Oral Health 2022; 22: 573.

10.

Yar

. Digital workflows for the management of tooth wear. Br Dent J 2023; 234: 427–431.

11.

Opdam

NJM

Collares

Hickel

, et al. Clinical studies in restorative dentistry: new directions and new demands. Dent Mater 2018; 34: 1–12.

12.

Chiu

Lee

Liu

, et al. Evaluation of the marginal adaptation and gingival status of full-crown restorations using an intraoral camera. BMC Oral Health 2022; 22: 517.

13.

Pan

Westland

. Tooth color and whitening – digital technologies. J Dent 2018; 74 Suppl 1: S42–s46.

14.

Tatano

Berkels

Ehrlich

, et al. Spatial agreement of demineralized areas in quantitative light-induced fluorescence images and digital photographs. Dentomaxillofac Radiol 2018; 47: 20180099.

15.

Ong

JMD

Crasto

Anwar

, et al. A standardized approach to extra-oral and intra-oral digital photography. JoVE 2022; 185: e63627.

16.

Fei

. Comprehensive review of surgical microscopes: technology development and medical applications. J Biomed Opt 2021; 26: 010901–010901.

17.

Chiu

Faraz

Zhang

, et al. A study on the application of intraoral camera in the identification of oral anatomical landmarks. J Peking Univ Health Sci 2023; 55: 120–123.

18.

Valizadeh-Haghi

Naslseraji

, et al. Smartphone photography as a teledentistry method to evaluate anterior composite restorations. Int J Dent 2023; 1: 3171140.

19.

Hou

. Role of the operating microscope in diagnosis and treatment of endodontic diseases. Chinese J Stomatol 2018; 53: 386–391.

20.

Ilhan

Lin

Guneri

, et al. Improving oral cancer outcomes with imaging and artificial intelligence. J Dent Res 2020; 99: 241–248.

21.

Pentapati

Siddiq

. Clinical applications of intraoral camera to increase patient compliance – current perspectives. Clin Cosmet Investig Dent 2019; 11: 267–278.

22.

Piedra-Cascón

Adhikari

Özcan

, et al. Accuracy assessment (trueness and precision) of a confocal based intraoral scanner under twelve different ambient lighting conditions. J Dent 2023; 134: 104530.

23.

Katsaros

Ostrowski

Wlodarczak

, et al. Multi-task video enhancement for dental interventions. Lect Notes Comput Sc 2022; 13437: 177–187.

24.

Zhong

Gao

Zheng

, et al. Efficient Spatio-Temporal Recurrent Neural Network for Video Deblurring. In: 16th European Conference on Computer Vision (ECCV), 2020. Springer: Cham. 2020, pp.191–207.

25.

Chu

Huang

. Instance-Aware Image Colorization. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020. IEEE, 2020, pp.7965–7974.

26.

Westland

Luo

, et al. Investigation of the perceptual thresholds of tooth whiteness. J Dent 2017; 67s: S11–S14.

27.

Gomez-Polo

Martin Casado

Gomez-Polo

, et al. Colour thresholds of the gingival chromatic space. J Dent 2020; 103: 103502.

28.

Ryu

Kim

, et al. Evaluation of artificial intelligence model for crowding categorization and extraction diagnosis using intraoral photographs. Sci Rep 2023; 13: 5177.

29.

Gerhardt

MDN

Fontenele

Leite

, et al. Automated detection and labelling of teeth and small edentulous regions on cone-beam computed tomography using convolutional neural networks. J Dent 2022; 122: 104139.

30.

Kirillov

Mintun

Ravi

, et al. Segment anything. In: 2023 IEEE/CVF International Conference on Computer Vision (ICCV), 2023. IEEE, 2023, pp.4015–4026.

31.

Budd

Robinson

Kainz

. A survey on active learning and human-in-the-loop deep learning for medical image analysis. Med Image Anal 2021; 71: 102062.

32.

Zhang

, et al. A human-in-the-loop deep learning paradigm for synergic visual evaluation in children. Neural Netw 2020; 122: 163–173.

33.

Peng

Rousseau

Shortliffe

, et al. AI-generated text may have a role in evidence-based medicine. Nat Med 2023; 29: 1593–1594.

34.

Cao

Liu

, et al. A comprehensive survey of AI-generated content (AIGC): A history of generative ai from gan to chatgpt. arXiv preprint arXiv:230304226 2023.

35.

Pataranutaporn

Danry

Leong

, et al. AI-generated characters for supporting personalized learning and well-being. Nat Mach Intell 2021; 3: 1013–1022.

36.

Chan

KCK

Wang

, et al. BasicVSR: The Search for Essential Components in Video Super-Resolution and Beyond. In: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021. IEEE, 2021, pp.4945–4954.

37.

Chan

KCK

Zhou

, et al. BasicVSR++: Improving Video Super-Resolution with Enhanced Propagation and Alignment. In: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022. IEEE, 2022, pp.5962–5971.

38.

Xue

Chen

, et al. Video enhancement with task-oriented flow. Int J Comput Vis 2019; 127: 1106–1125.

39.

Kalluri

Pathak

Chandraker

, et al. FLAVR: Flow-Agnostic Video Representations for Fast Frame Interpolation. In: 2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2023. IEEE, 2023, pp.2070–2081.

40.

A real-time interactive restoration system for intraoral digital videos using segment anything model. https://yogurtsam.github.io/ive (accessed 2023).

41.

Katsaros

Kopa Ostrowski

Jezierska

, et al. Vident-lab: a dataset for multi-task video processing of phantom dental scenes. In: Gdańsk University of Technology, 2022. https://doi.org/10.34808/1jby-ay90.

42.

Zhang

Zhu

Isola

, et al. Real-time user-guided image colorization with learned deep priors. Acm T Graphic 2017; 36: 119.

43.

Multimodal Advanced, Generative, and Intelligent Creation Toolbox. https://github.com/open-mmlab/mmagic (accessed 16 october 2023).

44.

Gkioxari

Dollár

, et al. Mask R-CNN. IEEE Trans Pattern Anal Mach Intell 2020; 42: 386–397.

45.

Choi

Kim

, et al. Mask R-CNN based multiclass segmentation model for endotracheal intubation using video laryngoscope. Digit Health 2023; 9: 20552076231211547.

46.

Wang

. Segment anything in medical images. Nat Commun 2024; 15: 654.

47.

Zhou

Rahman Siddiquee

Tajbakhsh

, et al. UNet++: A Nested U-Net Architecture for Medical Image Segmentation. In: 2018 Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support (DLMIA), 2018. Springer: Cham. 2018, pp.3–11.

48.

Cho

Hong

, et al. Rethinking Coarse-to-Fine Approach in Single Image Deblurring. In: 2021 IEEE/CVF International Conference on Computer Vision (ICCV), 2021. IEEE, 2021, pp.4621–4630.

49.

Kingma

. Adam: A Method for Stochastic Optimization. In: 3rd International Conference on Learning Representations (ICLR), 2015.

50.

Selvaraju

Cogswell

Das

, et al. Grad-CAM: visual explanations from deep networks via gradient-based localization. Int J Comput Vis 2020; 128: 336–359.

51.

Zhou

Jiang

Cheng

, et al. Detecting representative characteristics of different genders using intraoral photographs: a deep learning model with interpretation of gradient-weighted class activation mapping. BMC Oral Health 2023; 23: 327.

52.

Chen

L-C

Zhu

Papandreou

, et al.Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation. In: 15th European Conference on Computer Vision (ECCV), 2018. Springer: Cham. 2018, pp.801–818.

53.

Guo

Chen

Mallineni

, et al. Feasibility of oral health evaluation by intraoral digital photography: a pilot study. J Int Med Res 2021; 49: 300060520982841.

54.

Ferrazzano

Orlando

Cantile

, et al. An experimental in vivo procedure for the standardised assessment of sealants retention over time. Eur J Paediatr Dent 2016; 17: 176–180.

55.

Chen

Zhu

Ding

, et al. SAM-Adapter: Adapting Segment Anything in Underperformed Scenes. 2023 IEEE/CVF International Conference on Computer Vision (ICCV), 2023. IEEE, 2023, pp.3367–3375.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

1.39 MB