Abstract
Background
The diagnosis of otitis media with effusion (OME) requires substantial training and experience in otoscopic examination of children.
Objective
This study developed an artificial intelligence (AI) model to predict OME diagnosis in children.
Methods
The source data were images of pediatric patients’ tympanic membranes obtained by otoendoscopy. A convolutional neural network was used in machine learning. The diagnostic features of the tympanic membrane, as labelled by the experts, and the surgical findings served as the ground truth. InceptionV4 built the final model. The model was trained using the Adaptive Moment Estimation optimizer with an initial learning rate of 0.0001 and a total duration of 100 epochs. The batch size was 32. The Categorical Cross-Entropy loss function was employed for the internal validation. The outcome was to distinguish between OME and normal tympanic membrane. A confusion matrix was used to assess the model’s performance. The model was tested for agreement with otolaryngologists and implemented as a web application.
Results
The initial sample size was 320 pictures. For OME, the model achieved an accuracy of 94.7% (95% CI 0.88, 1). The F1 score was 96% (95% CI 0.89, 1), and the area under the receiver operating characteristic curve was 0.98 (95% CI 0.93, 1). The kappa agreement between AI and experienced otolaryngologists was 0.627 (p < 0.001).
Conclusion
An AI diagnostic model for otitis media with effusion had good accuracy and moderate agreement with otolaryngologists. The model should be helpful for preliminary diagnosis, telemedicine, or educational purposes.
Keywords
Introduction
Otitis media (OM) is a common problem encountered by pediatricians as well as pediatric otolaryngologists. The primary diagnosis of OM by pneumatic otoscopy is recommended by the clinical practice guideline. 1 Despite the good qualities of modern otoscopes, there are existing problems of the diagnosis of OM in infants and children. Performing otoscopy in infants or young children can be difficult for trainees or inexperienced physicians due to earwax and a small ear canal. Children with acute otitis media (AOM) are usually in pain so they will not co-operate with the ear examination. Otitis media with effusion (OME) is the most problematic type of OM because the disease is asymptomatic so the onset is not clear. If the patient is not a child at risk who is under the regular surveillance for ear diseases, an early diagnosis may be overlooked. The problems of the diagnosis of otitis media in children may lead to an inadequate treatment, complications, permanent damage to the middle ear or hearing loss.
There are certain characteristic features of the tympanic membrane (TM) which are important for the diagnosis of OM. These features included color, transparency, mobility, middle ear fluid and position or retraction of the TM. During the long duration of OM, the appearance of the TM will change over time depending on the degree of inflammation, type of the middle ear fluid and negative pressure in the middle ear. There are wide variations in the findings of the TM in OME, which makes diagnosis more difficult. The interpretation of otoscopic findings in OME needs substantial training and a certain period of practice. The results of the previous study 2 showed that both pediatricians and otolaryngologists had problems of giving the otoscopic diagnosis, which improved after attending the otoscopic teaching sessions.
Video oto-endoscopy is useful for obtaining pictures of the TM for diagnosis, serving educational purposes, and improving the diagnostic skills of residents and medical students. 3 The images obtained from otoscopic examination can be used in the development of an artificial intelligence (AI) to help in the diagnosis of otitis media and other ear diseases in children. Artificial intelligence had been developed from the computer vision using deep learning algorithms to analyze the otoscopic images and forming the diagnostic or classification model. 4 The diagnostic model using convolutional neural network (CNN) is the classification method that achieve the greatest accuracy. 5
At present, there is no AI prediction model which is universally used as a gold standard for the diagnosis of OME in children. Some previous AI models were developed from images of both adult and children with otitis media and other ear diseases.6,7 The ground truth for the AI in the previous meta-analysis was the diagnosis by otolaryngologists and pediatricians. 5 This study had a particular focus on OME due to the high prevalence and variations of the TM findings. The input for the training of the AI should include all characteristic features of the TM to achieve the most accurate diagnosis. In our previous study, 8 transparency and retraction of the TM were the important features because they had significant association with conductive hearing loss in children with OME. The objective of this study was to develop an AI model to predict the diagnosis of OME in children. The AI was intended to be a preliminary diagnostic tool for trainees, general otolaryngologists, pediatricians, or physicians with less experience in diagnosing OME in children.
Methods
We conducted a diagnostic study by developing an AI model for the diagnosis of OME in children, using machine learning with CNN. The diagnosis from the experienced pediatric otolaryngologists combined with the surgical findings at myringotomy were the gold standard for the AI training. Still pictures and video of the TM were taken during the ear examinations of children who attended the pediatric otolaryngology clinic at a university hospital, which is a tertiary care center, between 2023 to 2024. The inclusion criteria were children less than 15 years old who presented to the clinic with problems of ear diseases, hearing loss, delayed speech, and/or delayed development. We also included children at risk who were referred from the other services for surveillance of OM or hearing loss, such as children with cleft palate, craniofacial anomalies, syndromic or genetic diseases. The exclusion criteria were cases with otitis externa, foreign body, previous ear surgery or cholesteatoma. Written informed consent from the parents or legally authorized representatives of the participants was obtained prior to the study. Video-otoendoscopy in children with ear diseases is the routine practice at the institute for the purpose of clinical diagnosis and patient education according to the standard of care. The instrument used for obtaining the pictures was a zero-degree, 3-mm tele-otoscope (Karl Storz, Tuttlingen, Germany) inserted in the ear speculum with pneumatic application and video recording system (Telecam C3, Karl Storz, Tuttlingen, Germany). Still pictures were captured by the telecam, while the selected frames from the video were converted to still images by the ACDSee Pro10 program. The original images were in JPG file with 1920x1080 pixels. The pictures and videos of the TM were stored in the database by the research number for each case without any identification. The data was used for the development of artificial intelligence (AI) through machine learning. This study received ethical approval from the Institutional Review Board of the affiliation of the first author (approval number Si733/2022) on October 25, 2022. The research was conducted in accordance with the Code of Ethics of the World Medical Association (Helsinki Declaration).
Preparation of the data started with the screening of the pictures. Blurry picture, picture with incomplete TM and picture with ear wax obscuring the TM were excluded. Black edges of the pictures from oto-endoscopy were cropped out. Two pediatric otolaryngologists with 30 years and 15 years of experience respectively, interpreted the findings of the pneumatic oto-endoscopy. The pictures of the TM were classified by the characteristic features and the diagnosis. The features consisted of color, transparency, mobility, middle ear fluid, position/retraction, and perforation. The presence of middle ear fluid was confirmed at myringotomy with tympanostomy tube insertion, which was done according to the guidelines of tympanostomy tube insertion in children. 9 The accuracy of the experts for the diagnosis of OME was verified with the surgical findings, which were done under the operating microscope. The operative findings of the tympanic membrane and middle ear findings were recorded by the surgeons who were not involved with the study.
The diagnoses of the ear examination were given as normal TM or healed TM, acute otitis media (AOM), otitis media with effusion (OME) and TM perforation. Normal TM was defined as the TM with clear, pearly-gray transparent surface, freely-mobile with pneumatic application, with no middle ear fluid and no retraction. Normal TM was confirmed with normal hearing level at 20 decibels or less with no air-bone gap from pure tone audiogram. In small children, the same hearing level could be obtained from brain stem response audiometry (ABR) or auditory steady state response audiometry (ASSR). Healed TM was defined as TM after the recovery of OM with stigmata such as scar, tympanosclerosis or monomeric membrane without middle ear effusion or active diseases. Acute otitis media (AOM) was defined by bulging of the TM, with middle ear effusion and signs of mucosal inflammation. Otitis media with effusion (OME) was defined by the presence of middle ear effusion, with or without retraction of the TM, and the absence of signs of acute inflammation of the TM or the external ear canal. Retraction of the TM was defined by the attachment of the TM to any structures of the middle ear other than the bony annulus, the lateral process of malleus and the umbo.
Agreement of the interpretation of the features and the diagnosis was assessed by kappa statistics. The authors made the decision of the diagnosis and classification of each feature of the TM before the annotation of the pictures. Discrepancies were resolved by discussion and consensus. After the classification, the characteristic features on each picture of the TM were annotated by Google Drawings. Color codes were given for the area of annotation of each feature: green for the boundary of the TM, red for the area with classified transparency, yellow for the presence of middle ear effusion, blue for retraction, and purple for perforation. Mobility of the eardrum was not used as the training feature in this phase because of the difficulty of annotation. The annotated pictures were transferred with the original copies for the training of the AI model. The pictures were re-size to 224x224, and 512x512 pixels. Data augmentation was performed to give a thorough view of the TM from different angles and to increase the sample size. Augmentation reflects the real-world otoscopic variability when the picture is taken from the ear canal with different sizes and shapes. There are also variations in the size and axis of the TM according to the age of the child. Augmentation of the TM was consisted of changing the size and shape of the TM by rotation, shifting of height and width, vertical and horizontal flip, shearing or zooming of the pictures. Augmentation of the appearance of the pictures was also performed by shifting of the colors, changing the level of the middle ear fluid, adjusting the saturation, contrast, and brightness of the pictures. The augmentation was performed after the splitting of the data to prevent data leakage. The data was divided into 70% for the training set, 15% for the validation set, and 15% for the test set. The pictures in each set came from the different groups of patients. All augmented images derived from the original images of each group were confined to the same data split throughout the study.
The data of the training set was the input for the different architectures of convolutional neural network (CNN). Prediction models were built using different architectures of the CNN. The final prediction models were designed using two methods. The first method was to define the CNN architecture that could predict each characteristic feature of the TM. The architectures that gave the best accuracy for each feature were combined to give the diagnosis using the decision tree. The second method was to train the machine using the image data of the TM solely with the given diagnosis. The model was built using the performance of the CNN to learn the visual pattern by the extraction of the features, building the matrix for pooling and convolutional layers to predict the diagnosis. The confusion matrix was used to calculate the diagnostic parameters and F1 score of the AI model. The hardware used in the process of deep learning by CNN consisted of a GPU: NVIDIA A100 40GB, RAM: 52GB, CPU: Intel Xeon CPU @ 2.30 GHz, storage: 147 GB.
The AI model for the diagnosis of otoscopic findings was deployed in the form of web application. Another unknown test set was used to validate the AI model with the diagnosis given by two other pediatric otolaryngologists with equal experience of 15 years, who were blinded to the history and previous physical examination of the cases. Agreement between the diagnosis by the AI model and the pediatric otolaryngologists was analyzed. The report of the study followed the checklist of TRIPOD+AI statement of the EQUATOR initiatives. 10 (Supplemental Tables 1 and 2).
Statistical analysis
Descriptive statistics were used for the categorization of oto-endoscopic findings, measured in the number and percent. The confusion matrix was used for the calculation of the sensitivity, specificity, accuracy, and area under the receiver operating characteristic curve. F1 score was calculated as true positive (TP) divided by TP +1/2 (False positive +False negative). Calculation of 95% CI was done with the exact binomial method. Agreement between AI and pediatric otolaryngologists was calculated by the kappa statistic. The statistical calculation was done using Microsoft Excel and SPSS version 22.0.
Results
Oto-endoscopic findings.
TM = tympanic membrane, OME=otitis media with effusion, AOM=acute otitis media.

Picture annotation examples of pre-annotation and post-annotation pictures of the tympanic membrane. Annotation was done with color codes. Green is the boundary of the tympanic membrane, red is the area of specified transparency (or tympanosclerosis), blue is the area of retraction, yellow is the area of fluid which filled the middle ear. (a) (b) (c) otitis media with effusion, pre-annotation (d) (e) (f) otitis media with effusion, post-annotation.

Data augmentation. An example of data augmentation of one picture of the tympanic membrane. Modification of the picture was done by rotation, shifting in height and width, shearing, horizontal and vertical flip and zooming. Augmentation reflects the real-world otoscopic variability when the picture is taken from the ear canal with different size and shape. There are also variations of the size and axis of the TM according to age of the child. Rotation_ range = 20, height_shift_range = 0.15, width_shift_range = 0.15, shear_range = 0.9, horizontal_flip = true, vertical_flip = true, zoom_ range = 0.3, fill_ mode = nearest.
Prediction model of the characteristic features.
Architecture of the convolutional neural network used in building the artificial intelligence model with best accuracy for the prediction of each characteristic feature and for the diagnosis.
As the most prevalent diagnosis the TM with disease was OME, another model was built by Inception V4 to differentiate between normal TM and OME. An InceptionV4 architecture pretrained on ImageNet was employed for binary classification of Normal and OME cases. This architecture was selected due to its ability to capture multi-scale features through parallel convolutional pathways, which is advantageous for modeling complex and heterogeneous patterns in otoscopic images. The pictures of normal or healed TM and OME were augmented to five-fold increase of the sample size. The model was trained by inceptionV4 by extracting the characteristic features of the TM, and building the matrix of the new picture. Representative pixels of each zone (pooling) were arranged into feature vectors for the training of the model. Weight adjustment was used to classify the picture into the probability of being OME or normal TM. The training parameters of Inception V4 were shown together with those of the first model in the Supplemental Table 3. The model was trained using the Adaptive Moment Estimation (Adam) optimizer, selected for its computational efficiency and its ability to dynamically adapt learning rates for individuals. The initial learning rate was empirically set to 0.0001 to guarantee stable parameter updates. The training process utilized a batch size of 32, striking an optimal balance between hardware memory constraints and the need for accurate gradient estimations per iteration. The objective involved distinguishing between two distinct clinical classes (“Normal” and “OME”). The Categorical Cross-Entropy loss function was employed. The total training duration was initially slated for a maximum of 100 epochs. The termination was governed by an early stopping mechanism with a patience of 10 epochs, continuously monitoring the validation loss.
Confusion matrix of the test set for the diagnosis between normal tympanic membrane and otitis media with effusion.
Confusion matrix of the test set of 38 cases, predicted by the model built by Inception V4.
Sensitivity = 1.0 (95%CI 0.85,1.0),
Specificity = 0.87(95%CI 0.64,0.96),
Accuracy = 0.947 (95%CI 0.88, 1.0)
F1 score = 96% (95%CI 0.89, 1),
Area under the receiver operating characteristic curve (AUC) = 0.98 (95%CI 0.93,1).
Model predictions were generated by passing input images through the trained InceptionV4 network implemented in PyTorch, which outputs a probability score representing the likelihood of the image belonging to the OME class. The final layer employs a sigmoid activation function, producing a probability value p ∈ [0, 1]. For classification, the predicted probability was compared against a predefined decision threshold. In this study, a default threshold of 0.8 was used, such that predictions with p ≥ 0.8 were classified as OME, while those with p < 0.8 were classified as normal TM. Formally, the prediction rule can be expressed as: ŷ = 1 if p ≥ τ, and 0 otherwise, where p denotes the predicted probability of OME and τ is the decision threshold.
The model was deployed as part of a web-based application, where inference was performed via a FastAPI backend integrated with a Next.js frontend. Users upload a single otoscopic image and specify a desired confidence threshold, which is passed to the backend along with the image. The backend processes the image, performs model inference, and returns the predicted probability along with the final classification result based on the specified threshold. To support flexible deployment in clinical settings, the decision threshold for classification was designed to be user-adjustable within the range of 50% to 100%. The web application and prediction of OME or normal TM was shown in Figure 3. Web application of the artificial intelligence (AI) for the diagnosis of oto-endoscopic examination using inception V4 architecture. Diagnosis of the oto-endoscopic examination was obtained by uploading the picture of the tympanic membrane and press the “analyze” button. The model predictions were calculated by the application programming interface (API). The decision threshold for classification was within the range of 50% to 100%. The results will be shown with the confidence threshold above 80%. The AI will report the probability of the diagnosis between otitis media with effusion and normal tympanic membrane. (a) Otitis media with effusion, the API score was 99.81%. (b) Normal tympanic membrane, the API score was 100%.
Two pediatric otolaryngologists with 15 years of experience, who were not in the process of AI development, validated AI with 20 unknown cases. The average prediction performance of both physicians was 75%, while the AI achieved 85% of the correct prediction. The overall agreement (kappa) between the experienced pediatric otolaryngologists and AI was 0.627, p < 0.001.
Discussion
Otitis media with effusion was the most prevalent type of OM in our study, because most of our patients were “children at risk” 1 who came for regular follow-up for the surveillance of ear diseases. Detection of OME in children at risk is important to prevent conductive hearing loss and learning disabilities. 1 In the study of Zeng et al., 6 deep learning predicted conductive hearing loss from the picture of OME with an accuracy of 81%. We developed an AI to interpret the findings of oto-endoscopic examination in children, focusing on OME. The method of machine training was supervised learning using the labelled data as the input. The technique of the training was CNN, which is commonly used for image recognition tasks. 11 Oto-endoscopy gives high-quality images of the TM with clear presentation of the characteristic features. From the previous meta-analysis, 12 the machine trained by the images from endoscopy had a greater area under the curve (AUC) than the otoscopic images. We also labelled the boundary of the TM, separating them from the surrounding external ear canal, for the precision of the data.
The diagnosis of OME is more subtle than AOM due to the variations in the TM characteristics. In our study, labelling was done by experienced pediatric otolaryngologists. The presence of the middle ear fluid was confirmed at myringotomy as the “ground truth” of the diagnosis. Labelling of each feature seen on the TM designated the region that was important for the prediction. The first model predicted each characteristic feature, before reaching the diagnosis by the decision tree. In Table 2, some models show high validation but lower test accuracy. Among the TM characteristic features, the gap between the validation and test accuracy was highest when predicting the color of the TM. Overfitting might be the cause of the gap but the color of the TM was the feature with great variability. The unseen test set might have a lot of unique color features that were different from what the model had been trained. In the study of Zeng et al., 6 The characteristic feature which had the significant effect on the diagnosis of OME was the presence of middle ear fluid, while the color of the TM had no effect on the identified hearing loss. 6 The other characteristic features in our study had the gap of less than 10%. During the usual validation process, the tuning of the hyperparameters is repeated to achieve the best performance, while the accuracy of the test set is the result of one-time prediction. For the diagnosis, the model developed from InceptionV4 had a gap of only 2.3% between the validation and test accuracy.
In the second model developed by Inception V4, the CNN performed the visual analysis by scanning through the picture for the difference in texture, edges, or color gradient, extracting the prominent features, and gathering details for the diagnosis. Data augmentation and transfer learning on the pre-trained CNN model, such as Inception V4, helped to improve the accuracy and reliability of the AI model. 12 The AI in our study could differentiate OME from normal TM with an accuracy of 94.7% and an AUC of 0.98. The meta-analysis by Habib et al. 5 showed the pooled accuracy of AI to classify ear diseases at 90.7%, using the source data from otoscopy and online sources from Google images. In the study of Crowson et al., 13 the AI could differentiate the middle ear effusion as present/absent with the accuracy of 83.8% and AUC of 0.93 by using exclusively intra-operative data for the training. The sensitivity and specificity of the AI in our study corresponded well with the pooled results of the meta-analysis of machine learning in diagnosing middle ear disorders using TM images by Cao et al. 12
The AI in our study had a moderate agreement with two experienced pediatric otolaryngologists in the diagnosis of an unknown test set. In the study of Shim et al., 14 the trained CNN had better diagnostic accuracy than the performance of seven otolaryngologists. Crowson et al. 15 reported the human versus machine validation of a deep learning algorithm for pediatric middle ear infection and found the average prediction accuracy from many subspecialties of physicians at 65%, while AI achieved the accuracy of 95.5%.
The AI in our study was displayed in the form of a web-based tool, which is easy to use on a personal computer. The pictures from the oto-endoscope can be stored in the portable hard drive and uploaded into the AI in the outpatient setting. Zhong et al. 16 developed the Pediatric Otitis Media Classifier, which is an automated classifier for AOM and OME to be used on the local computer with an accuracy of approximately 98%. In our study, there was no difference between the development data and the data used to evaluate the model performance. The pictures of the TM used in the data development were taken from pediatric patients from all regions of the country as our institute is a tertiary care center. The web-based tool can be useful as a telemedicine consultation where the pediatric otolaryngologist or otologist is not available.
The limitation of our study was the small sample size. Besides blurring and ear wax, a lot of pictures were excluded because the whole TM was not captured in the frame. We selected only the pictures of complete TM with good visibility for the training of AI. The only concern for the user is the uploaded picture should be complete and clear to achieve the correct prediction. The user should be familiar with the use of a telescope or digital otoscope, which are widely available. In the study of Cha et al., 17 they included pictures with partial visibility of the TM to train the AI as in the real clinical situation. The final output of the AI in our study did not include AOM or TM perforation because the training sample size was too small, which might affect accuracy when applied to unknown samples. The model was focused on OME, which is more difficult to diagnose, but the clinical applicability may be narrow, particularly in the primary care setting. However, the diagnosis of AOM needs a history of symptoms and signs of the patient. Noda et al. 18 integrated the patient’s data with otoscopic images to develop the multimodal AI using GPT-4 vision, using language processing with image analysis. On the other hand, OME is mostly asymptomatic. The usual symptoms noticed by the parents are hearing loss or delayed speech, although these symptoms do not occur in every case. The diagnosis of OME can be made mostly by the otoscopic findings. Due to the small sample size for external testing, the external validation with experienced otolaryngologists in our study is preliminary. Prospective validation in the form of multi-center study is needed to confirm the accuracy and usefulness of the AI model. Future research should be done using the mobility of the eardrum in the diagnosis of OME, which was not included in the training of the AI in this study.
Conclusion
An AI model for the diagnosis of OME was made from the CNN trained with digital image processing of the pictures taken from oto-endoscopy. The final model was made from the architecture of InceptionV4 and was applied as a web-based tool. The AI model was able to differentiate OME from otherwise normal TM with good accuracy and moderate agreement with experienced pediatric otolaryngologists. The AI should be helpful in giving the preliminary diagnosis, being a diagnostic aid in telemedicine, or being used as self-training lessons for educational purposes. The recognition of OME, especially in children at risk, should help provide the timely management or surgical intervention to prevent hearing loss and irreversible damage to the TM and middle ear structures.
Supplemental material
Supplemental material - An artificial intelligence model for the diagnosis of otitis media with effusion in children
Supplemental material for An artificial intelligence model for the diagnosis of otitis media with effusion in children by Kitirat Ungkanont, Akadej Udomchaiporn, Nopavit Sriphoonga, Thanakrit Wannarong, Thaweewat Rugsujrit, Tachasit Chueprasert, Archwin Tanphaichitr, Vannipa Vathanophas in DIGITAL HEALTH
Supplemental material
Supplemental material - An artificial intelligence model for the diagnosis of otitis media with effusion in children
Supplemental material for An artificial intelligence model for the diagnosis of otitis media with effusion in children by Kitirat Ungkanont, Akadej Udomchaiporn, Nopavit Sriphoonga, Thanakrit Wannarong, Thaweewat Rugsujrit, Tachasit Chueprasert, Archwin Tanphaichitr, Vannipa Vathanophas in DIGITAL HEALTH
Footnotes
Acknowledgements
The authors would like to express our appreciation to Miss Jeerapa Kerdnopakun and Miss Sanyaluck Wattanachalermyos for the technical assistance in the preparation of the manuscript.
Ethical considerations
This study received ethical approval from the Institutional Review Board of the affiliation of the first author (approval number Si733/2022) on October 25, 2022. The research was conducted in accordance with the Code of Ethics of the World Medical Association (Helsinki Declaration).
Consent to participate
Written informed consent from the parents or legally authorized representatives of the participants was obtained prior to the study.
Author contributions
Funding
The authors received no financial support for the research, authorship, and/or publication of this article.
Declaration of conflicting interests
The authors declare no potential conflicts of interest with respect to the research, authorship, and/or publication of this article. The authors did not use any AI tools in the development or editing of the manuscript.
Data Availability Statement
The study data (otoendoscopic images) cannot be made publicly available as they were patient-related information obtained by informed consent for the study under the institutional IRB approval only. The trained model weights and application programming interface are currently proprietary and not publicly available.
Supplemental material
Supplemental material for this article is available online.
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
