Abstract
Otitis media refers to the inflammation of the middle ear. Visual examination of middle ear tympanic membrane is required using a standard otoscope device for the diagnosis of ear diseases. Experts visually evaluate the tympanic membrane as part of their clinical examination. However, visual inspection introduces human errors and results in limited diversity among the observers. Meanwhile, accurate diagnosis is quite challenging for non-specialists or general practitioners, as it depends on the observer’s expertize through otoscopy. Our proposed approach uses hybrid Inception-ViT model, that incorporates multi-scaled fine-grained features in vision transformer, and our research evaluates its performance against several convolutional networks. This study presents a versatile and comprehensive architecture that automatically extracts fine-grained feature patches using an inception module, followed by a vision transformer encoder to classify tympanic membrane. A total of 956 otoscope images were employed to train vision transformer based neural network architecture to classify tympanic membrane into five categories of ear diseases, covering most common ear diagnosis (Normal, Acute Otitis Media, Chronic Suppurative Otitis Media, Earwax, Other). To address the issue of limited dataset, we make use of image augmentations like flip, rotate etc. whereas weighted cross entropy loss is used to address class-imbalance issue. Our proposed approach results in improved accuracy of 83.05% for tympanic membrane detection. This methodology could potentially be utilized in future’s clinical otological decision support systems to enhance diagnostic accuracy for physicians and minimize the total incidence of misdiagnosis. The final results are comparable to those of leading clinical specialists and provide operator-independent diagnosis of otitis media with higher accuracy.
Keywords
Get full access to this article
View all access options for this article.
