Combining Gabor-local and contextual-global deep features for cholesteatoma classification

Abstract

A key challenge in classifying cholesteatoma images lies in the effective extraction and integration of both local and global features. This paper addresses this challenge by introducing a Gabor-local and contextual-global network. This network enhances local feature extraction through the incorporation of Gabor filtering alongside a ResNet architecture, which captures richer local information at varying scales and orientations. Moreover, the attention mechanism generates an importance score map, enabling the network to dynamically focus on the most relevant regions within the input image by normalizing scores into a weight matrix and supporting Generalized-Mean (GeM) pooling, resulting in a more robust global descriptor. Finally, a novel dynamic feature fusion strategy combines local features, global features, and outputs from the last residual block based on their computed importance scores, further enhancing the model’s overall performance. Experimental results show the network’s superior performance, effectively leveraging both local and global information for enhanced classification outcomes.

Keywords

cholesteatoma local feature global feature dynamic feature fusion

1. Introduction

Accurate differentiation between cholesteatoma and a normal middle ear is crucial, as cholesteatoma may cause progressive hearing loss (as shown in Figure 1), tinnitus, and structural damage,^1,2 often necessitating surgical intervention. In contrast, a normal middle ear does not require treatment, highlighting the importance of precise diagnosis. Advanced imaging techniques, in combination with artificial intelligence (AI), play a pivotal role in this process. While imaging can reveal subtle anatomical differences, AI further enhances diagnostic accuracy by analyzing complex data patterns, thereby improving classification performance and supporting personalized treatment planning.

Figure 1.

Image samples of normal middle ear and cholesteatoma.

Early efforts in AI ear image classification employ deep learning on two dimensional Computed Tomography (CT) slices to detect chronic otitis media³ and apply machine learning to tympanic membrane images for middle ear condition classification.⁴ Subsequent CT approaches focus on cholesteatoma differentiation and pediatric temporal bone disease diagnosis using architectures such as VGG and hybrid models.^5,6 Structure aware methods also improve classification accuracy.⁷ In addition to CT, convolutional neural networks (CNNs) apply to intraoperative endoscopic and standard otoscopic images for cholesteatoma detection,^8,9 while lightweight models such as MobileNetV2 support targeted predictive tasks.¹⁰ Further methodological advances include deep feature fusion with graph isomorphism networks, explainable three dimensional CNNs, and hybrid CNN combined with machine learning strategies.^11–13 These developments include comparative evaluations of architectures such as ResNet, MobileNetV2 and DenseNet against Magnetic Resonance Imaging (MRI),^14,15 as well as systematic reviews of performance.¹⁶ Bibliometric analyses indicate a growing integration of AI and medical imaging in cholesteatoma research.¹⁷ Additionally, various works like DenseNet,¹⁸ SE-ResNet,¹⁹ Vision Transformer (ViT),²⁰ Bottleneck Transformer (BoTNet),²¹ ConvNeXt,²² iFormer,²³ TransNeXt,²⁴ and Visual Attention Network (VAN)²⁵ effectively utilized network architectures or attention mechanisms to achieve state-of-the-art results in multiple medical image classification tasks.

Previous methods in cholesteatoma image classification have achieved a certain degree of success; however, they still face challenges, particularly in effectively extracting and integrating local and global features. These approaches struggle to capture intricate spatial relationships and may overlook important contextual information, which can lead to less than optimal performance. To address these challenges, this paper introduces several innovative components. First, the proposed Gabor-local feature extraction, combined with a ResNet architecture, enhances local feature capture by obtaining richer information at varying scales. Second, a contextual-global feature extraction aggregation is developed, generating the importance score map that enables the model to dynamically focus on relevant image regions while normalizing scores into a weight matrix and supporting Generalized-Mean (GeM) pooling for a robust global descriptor that incorporates contextual significance. Lastly, a novel dynamic feature fusion strategy combines local features, global features, and outputs from the last residual block based on their computed importance scores. This synergy not only amplifies critical information but also strengthens the model’s capacity for multi-scale analysis, ultimately leading to improved classification performance in challenging cholesteatoma image analysis. Overall, the main contributions of this paper can be summarized as.

- The proposed Gabor-local feature extraction within the network enhances local feature capture, enabling a more nuanced representation of relevant patterns in cholesteatoma images.

- The developed contextual-global aggregation mechanism generates the importance score map, which enables the model to dynamically focus on relevant regions and supports robust global descriptors through GeM pooling.

- A dynamic feature fusion approach combines local and global features together with outputs from the last residual block based on importance scores, which leads to superior classification performance in challenging cholesteatoma image analysis.

- Experimental results demonstrate superior classification performance compared to conventional methods, showcasing the model’s effectiveness in leveraging local and global information.

The remainder of this paper is structured as follows: Section 2 reviews relevant works from various perspectives, establishing the foundation for our study. Section 3 introduces our proposed model, detailing its technical implementation and the underlying theoretical principles. In Section 4, we outline the experimental framework, including the datasets used for evaluation and the performance metrics applied. The paper concludes in Section 5 with a summary of the main findings and proposes possible directions for future research in this area.

2. Related works

2.1. Local feature extraction

CNNs have been frequently utilized for their inherent ability to capture local features across various applications. For example, Tian et al.²⁶ employed CNNs within a dual network structure to extract both local and global features for image denoising. Concurrently, the effectiveness of convolutional operations for capturing local details was implicitly studied by Utomo et al.,²⁷ who compared models using explicit local feature descriptors against standard CNN performance on medical images. Subsequent research continued to integrate CNNs for their local feature extraction capabilities, often within hybrid models. For instance, Mou et al.²⁸ designated a CNN channel purely for local feature extraction in a multimodal driver distraction system, while Hu et al.²⁹ applied a CNN module to summarize local planar features in magnetic resonance images before integrating them. In other imaging areas, Li et al.³⁰ leveraged the strength of CNNs in computing local dependencies for mineral prospectivity mapping, Wang et al.³¹ used CNN features to interactively capture local structural information for image super resolution, and Xu et al.³² proposed a specific dynamic CNN branch to effectively encode local pixel features for hyperspectral image classification. MedViTV2³³ incorporated Kolmogorov Arnold Network layers into a transformer, achieving efficient global and local feature perception with improved performance.

2.2. Global feature extraction

Recognizing the limitations of CNNs in capturing long-range dependencies and global context, several approaches have sought to enhance their ability to model global features. He et al.³⁴ introduced a framework with distinct branches for local and global attention within a CNN architecture to learn comprehensive representations for depression recognition. Wang et al.³⁵ developed a multitask attention CNN with a global feature shared network designed to extract globally relevant information for tasks like fault diagnosis and working condition identification. Similarly, Li et al.³⁶ combined global information from the entire image with component details in a multiscale CNN to improve synthetic aperture radar target recognition. Further, Liu et al.³⁷ integrated a CNN focused on local features with a vision transformer for learning global features to address high-resolution synthetic aperture radar images’ challenges. Xiong et al.³⁸ designed a CNN model with separate branches for extracting and fusing global and local features from electromyography signals to enhance hand gesture recognition. To tackle the struggle with long distance dependencies, Li et al.³⁰ integrated a Transformer with a CNN for mineral prospectivity mapping. Khan et al.³⁹ introduced a model with global and local convolutional components for histopathological slide classification. Lastly, Duan et al.⁴⁰ combined transformers with CNNs for multi-focus image fusion, enhancing interaction between features using knowledge distillation.

2.3. Feature fusion strategies

Several studies have explored feature fusion techniques within CNN frameworks to enhance performance across various domains. For instance, Khan et al.⁴¹ integrated low-level handcrafted features with multi-stage deep CNN features for image scene geometry recognition through fusion strategies. Extending this concept, Ding et al.⁴² developed a multi-feature fusion network combining multi-scale graph convolutional networks and multi-scale CNNs to fuse their outputs for hyperspectral image classification. In a parallel development, Li et al.⁴³ presented a combined architecture of a CNN for spatial features and an LSTM network for temporal features, fusing these along with intermediate convolutional layer features for motor imagery EEG classification. Similarly, Ou et al.⁴⁴ designed a CNN framework incorporating feature fusion after applying band selection for hyperspectral image change detection. Xiao et al.⁴⁵ constructed a multi-scale CNN with an attention mechanism, employing feature fusion strategies to recognize the appearance of spot-welding surfaces. Further, Haq et al.⁴⁶ employed feature fusion within a deep CNN architecture, combined with ensemble learning, to improve the classification of abnormalities in mammographic images. Zhang et al.⁴⁷ developed a dual-channel CNN that enhanced feature representation by repeatedly fusing deep and shallow features for wildfire detection.

3. Methodology

In this paper, we present a classification method for cholesteatoma images that leverages a ResNet⁴⁸ architecture, integrating Gabor filtering for local feature extraction and an attention mechanism for contextual-global feature aggregation. As shown in Figure 2, the input image is first processed through residual blocks to generate feature maps. These feature maps are subsequently refined by Gabor filters, which further enhance local features. Concurrently, the attention mechanism produces an importance score map to identify relevant regions in the feature space. These scores are normalized into a weight matrix and applied using GeM pooling to create a robust global descriptor. Next, the local features, global features, and outputs from the last residual block are fused using the proposed dynamic feature fusion. Finally, this combined representation is passed through a fully connected layer to generate logits, which are then converted into class probabilities via Softmax.

Figure 2.

The input image is processed through residual blocks to create feature maps. Subsequently, Gabor-local feature extraction (GLFE) and contextual-global feature aggregation (CGFA) are applied to generate local and global features. Following this, dynamic feature fusion (DFF) is employed to combine the local and global features with the outputs from the last residual block, based on their significance scores. Finally, the fused features are passed through a fully connected layer to produce logits, which are then converted into class probabilities using the softmax function.

3.1. Gabor-local feature extraction

As illustrated in Figure 3, the Gabor-local feature extraction process begins with an input image $I \in R^{H \times W \times 1}$ , where H and W represent the height and width, respectively. At layer l of the last residual block, the output feature map $X^{[l]} \in R^{h \times w \times c}$ is computed using the formula:

X^{[l]} = F_{conv} (X^{[l - 1]}) + X^{[l - 1]},

(1)

where

F_{conv} : R^{h \times w \times c} \to R^{h \times w \times c}

represents the convolution transformation function mapping features from layer l − 1 to layer l, with h, w, and c denoting the height, width, and number of channels of the feature map, respectively. Before applying multi-scale Gabor filtering, an adaptive parameter selection mechanism is employed, in which the Gabor parameters are dynamically adjusted based on feature statistics:

λ_{i} = λ_{0} \cdot α_{λ} (X^{[l]}), σ_{i} = σ_{0} \cdot α_{σ} (X^{[l]}),

(2)

where α_λ and α_σ are adaptive scaling functions that analyze the feature map to determine appropriate parameter values, and λ₀ and σ₀ are base values. The wavelength scaling function α_λ is defined as:

α_{λ} (X^{[l]}) = clip (1 + β_{λ} \cdot std (X^{[l]}), 0.5, 2.0),

(3)

where

std (X^{[l]}) = \sqrt{1 / h \times w \times c \sum_{i, j, k} {(X_{i, j, k}^{[l]} - μ)}^{2}}

computes the standard deviation of the feature map,

μ = 1 / h \times w \times c \sum_{i, j, k} X_{i, j, k}^{[l]}

is the mean of the feature map, and β_λ is a learnable scaling factor. Similarly, the standard deviation scaling function α_σ is defined as:

α_{σ} (X^{[l]}) = clip (1 + β_{σ} \cdot mean (| X^{[l]} |), 0.5, 2.0),

(4)

where

mean (| X^{[l]} |) = 1 / h \times w \times c \sum_{i, j, k} | X_{i, j, k}^{[l]} |

computes the mean absolute value of the feature map, and β_σ is a learnable scaling factor. Once the parameters (λ_i, σ_i) have been adaptively determined, a multi-scale Gabor filtering approach is applied. A set of Gabor filters

{G_{i}}_{i = 1}^{N}

with varying parameters is defined, where each Gabor filter

G_{i} \in R^{k \times k}

is formulated as:

\begin{aligned} G_{i} (x, y; λ_{i}, θ_{i}, ψ_{i}, σ_{i}) = \exp (- \frac{x^{' 2} + y^{' 2}}{2 σ_{i}^{2}}) \\ \times \exp (i (\frac{2 π}{λ_{i}} x^{'} + ψ_{i})), \end{aligned}

(5)

where x′ = x cos θ_i + y sin θ_i and y′ = −x sin θ_i + y cos θ_i are the rotated coordinates, λ_i > 0 is the wavelength, θ_i ∈ [0, π] determines the orientation, ψ_i ∈ [0, 2π] is the phase offset, and σ_i > 0 is the standard deviation of the Gaussian envelope. Moreover, the Gabor filter is decomposed into real and imaginary parts:

G_{i, Re} (x, y) = \exp (- \frac{x^{' 2} + y^{' 2}}{2 σ_{i}^{2}}) \cdot \cos (\frac{2 π}{λ_{i}} x^{'} + ψ_{i}),

(6)

G_{i, Im} (x, y) = \exp (- \frac{x^{' 2} + y^{' 2}}{2 σ_{i}^{2}}) \cdot \sin (\frac{2 π}{λ_{i}} x^{'} + ψ_{i}),

(7)

such that convolution with the feature map yields two separate responses:

F_{i, Re}^{[l]} = G_{i, Re} * X^{[l]}, F_{i, Im}^{[l]} = G_{i, Im} * X^{[l]} .

(8)

Figure 3.

The input feature X^[l−1] is first processed by a residual block to produce the feature map X^[l]. An adaptive parameter selection mechanism then analyzes the statistical characteristics of X^[l] to determine the parameters for the Gabor filters. Using these parameters, multi-scale Gabor filters G_i,Re and G_i,Im are constructed, whose convolution with X^[l] yields the multi-scale feature representation $F_{multi}^{[l]}$ . This representation is subsequently refined by a spatial attention mechanism, which emphasizes the most informative regions. The resulting attention map is applied to $F_{multi}^{[l]}$ , producing the enhanced local features $F_{local}^{[l]}$ for downstream processing.

For each Gabor filter, the concatenated real and imaginary responses are computed as:

F_{i, concat}^{[l]} = Concat (F_{i, Re}^{[l]}, F_{i, Im}^{[l]}),

(9)

where Concat(⋅, ⋅) concatenates along the channel dimension, enabling the preservation of both amplitude and phase information. Collecting responses from N multi-scale Gabor filters yields:

F_{multi}^{[l]} = Concat (F_{1, concat}^{[l]}, F_{2, concat}^{[l]}, \dots, F_{N, concat}^{[l]}),

(10)

so that

F_{multi}^{[l]} \in R^{h \times w \times (2 N \cdot c^{'})}

, where c′ denotes the number of channels per filter output. To focus on the most relevant spatial locations within the features, a spatial attention mechanism

S : R^{h \times w \times C} \to R^{h \times w \times 1}

is employed to generate a spatial attention map. The implementation for S involves convolutional layers applied to the input feature map to output a single channel map, typically followed by a sigmoid activation function:

M^{[l]} = S (F_{multi}^{[l]}) = sigmoid ({Conv}_{spatial} (F_{multi}^{[l]})),

(11)

where

M^{[l]} \in R^{h \times w \times 1}

is the spatial attention map. Finally, the enhanced local feature response

F_{local}^{[l]}

at layer l is obtained by applying the spatial attention map element-wise:

F_{local}^{[l]} = F_{multi}^{[l]} ⊙ M^{[l]},

(12)

where ⊙ denotes element-wise multiplication with broadcasting along channels, followed by batch normalization:

F_{local}^{[l]} = BatchNorm (F_{local}^{[l]}) .

(13)

3.2. Contextual-global feature aggregation

As shown in Figure 4, contextual-global feature aggregation begins with GeM pooling applied to the feature tensor $X^{[l]} \in R^{H \times W \times C}$ . To ensure numerical stability, the feature map is first rectified as:

{\tilde{X}}^{[l]} = \max (X^{[l]}, ϵ),

(14)

where ϵ is a small positive constant. The GeM pooled global descriptor is then computed channel-wise as:

F_{preglobal, c}^{[l]} = {(\frac{1}{H W} \sum_{m = 1}^{H} \sum_{n = 1}^{W} {({\tilde{x}}_{c, m, n}^{[l]})}^{p})}^{\frac{1}{p}},

(15)

where c = 1, …, C, and p is a learnable pooling parameter controlling pooling sharpness. The preglobal descriptor is then used to generate context-aware attention parameters:

θ_{att} = g_{cond} (F_{preglobal}^{[l]}),

(16)

where g_cond(⋅) is the MLP. The attention function f_att is implemented as a 1 × 1 convolution whose parameters are determined by θ_att:

A_{regional} = f_{att} (X^{[l]}; θ_{att}),

(17)

where A_regional denotes the raw regional attention score map. The scores are normalized by spatial softmax:

w ″_{m, n} = \frac{\exp (a ″_{m, n})}{\sum_{h = 1}^{H} \sum_{k = 1}^{W} \exp (a ″_{h, k})},

(18)

where

a {''}_{m, n}

denotes the score at position (m, n). The regional feature is obtained by weighted aggregation:

{(F_{regional}^{[l]})}_{c} = \sum_{m = 1}^{H} \sum_{n = 1}^{W} w ″_{m, n} \cdot x_{c, m, n}^{[l]}, c = 1, \dots, C .

(19)

Finally, the global descriptor is refined by a residual connection:

F_{global}^{[l]} = F_{preglobal}^{[l]} + F_{regional}^{[l]} .

(20)

Figure 4.

The contextual-global feature aggregation first applies GeM pooling to X^[l] to obtain an initial global descriptor $F_{preglobal}^{[l]}$ , which captures the overall scene context. Conditioned on this descriptor, a dual-MLP network generates attention parameters θ_att, which guide a 1× 1 convolution followed by softmax normalization to produce regional attention weights W_regional. These weights are then applied to X^[l] via weighted summation, yielding a contextually refined regional feature F_{regional_sum}. Finally, a residual addition integrates the initial global descriptor with this refined representation, producing the output $F_{global}^{[l]}$ .

3.3. Dynamic feature fusion

To further enhance model performance, we develop a dynamic feature fusion approach that integrates local features, global features, and the output from the last residual block. Initially, these features are extracted from the preceding layers of ResNet. Subsequently, they are flattened into three distinct representations: $F_{local}^{[l]} \in R^{d_{1}}$ , $F_{global}^{[l]} \in R^{d_{2}}$ , and $F^{[l a s t]} \in R^{d_{3}}$ . Here, d₁, d₂, and d₃ denote the dimensions of the respective feature vectors, each projected to a common dimension via a linear layer before fusion. During the fusion phase, a function $\tilde{S} (\cdot)$ is utilized to compute the importance score for each feature. The function f(X) first applies a linear transformation followed by a non-linear activation, specifically:

f (X) = ReLU (W_{f} X + b_{f})

(21)

where W_f is the weight matrix and b_f is the bias term for the transformation. The scoring function

\tilde{S} (X)

is then defined as:

\tilde{S} (X) = W_{s} \cdot f (X) + b_{s}

(22)

where W_s is a weight matrix and b_s is a bias term, transforming the processed features into a scalar score. Subsequently, importance scores for each of the three features are calculated:

w_{local} = \tilde{S} (F_{local}^{[l]}), w_{global} = \tilde{S} (F_{global}^{[l]}), w_{last} = \tilde{S} (F^{[l a s t]})

(23)

These weights are then normalized to ensure their sum equals 1:

w_{total} = e^{w_{local}} + e^{w_{global}} + e^{w_{last}}

(24)

and

w_{item}^{'} = \frac{e^{w_{item}}}{w_{total}}, i t e m \in {local, global, last}

(25)

Next, the normalized weights are combined with the three features to yield a composite feature:

F^{[c o m b i n e d]} = w_{local}^{'} F_{local}^{[l]} + w_{global}^{'} F_{global}^{[l]} + w_{last}^{'} F^{[l a s t]}

(26)

Afterwards, a fully connected layer maps this fused feature into output space. The logits can be expressed as follows, where W_l denotes the linear transformation matrix and b represents the bias term:

Z = W_{l}^{T} F^{[c o m b i n e d]} + b

(27)

Finally, these logits are passed through a Softmax function to obtain class probability distributions:

P (y = \tilde{c} | X) = \frac{\exp (Z_{\tilde{c}})}{\sum_{j^{'} = 0}^{1} \exp (Z_{j^{'}})}, \tilde{c} \in {0,1}

(28)

During training, the cross-entropy loss function evaluates the discrepancy between predicted results and true labels, where y represents the true label:

L = - (y \log (P (y = 1)) + (1 - y) \log (P (y = 0)))

(29)

4. Experiment

4.1. Datasets

Our dataset consists of 562 images obtained through CT scanning using a 128 channel multidetector SOMATOM Definition Edge CT scanner (Siemens Inc., Munich, Germany). Each scan is acquired with consistent imaging parameters, including an axial section thickness of 0.625 mm, collimation of 128× 0.6 mm, a field of view measuring 220× 220 mm, a pitch of 0.8, and a matrix size of 224× 224. Imaging is conducted at 120 kV and 240 mAs to maintain uniform image quality throughout the dataset. Clinical labels for each ear derive from definitive diagnoses recorded in patients’ medical records. The dataset maintains a balanced representation of cholesteatoma and normal cases. To safeguard against data leakage and ensure independence between training, validation, and test sets, we apply stratified 5 fold cross-validation at the patient level. This approach preserves the integrity of the evaluation process by preventing patient data overlap across folds. Finally, model selection relies on the average performance across validation folds, promoting robustness and generalizability in the results.

4.2. Implementation details

The proposed classification network is developed and evaluated using PyTorch 1.9.0 on an NVIDIA GeForce RTX 3090. Prior to training, all input CT scans are resampled to a uniform voxel size of 1mm × 1mm × 1mm to ensure consistent spatial resolution across subjects. Subsequently, intensity normalization is performed to achieve zero mean and unit variance throughout the dataset, standardizing the overall distribution of voxel values. To enhance model generalization and mitigate overfitting, various data augmentation strategies are applied, including random rotations, horizontal flips, and brightness adjustments. The network is optimized using the Adam algorithm with an initial learning rate of 0.001 and a weight decay of 1e-5, enabling efficient parameter updates while introducing L2 regularization for improved stability. A learning rate scheduler further refines the training process by reducing the learning rate by a factor of 0.1 every 15 epochs over a total of 100 training epochs, promoting smoother convergence. To prevent unnecessary training and further reduce overfitting, an early stopping strategy is employed, which monitors validation performance and halts training automatically when no improvement is observed over five consecutive epochs, ensuring optimal model performance.

4.3. Evaluation metrics

Accuracy quantifies the frequency with which a model correctly classifies instances. It is defined as the ratio of correctly predicted instances to the total number of instances:

Accuracy = \frac{TP + TN}{TP + TN + FP + FN}

(30)

Here, TP represents true positives, TN indicates true negatives, FP denotes false positives, and FN refers to false negatives.

Sensitivity measures the proportion of actual positive instances correctly identified by the model. It is calculated using the following formula:

Sensitivity = \frac{TP}{TP + FN}

(31)

Specificity assesses how effectively a model identifies negative instances, defined mathematically as:

Specificity = \frac{TN}{TN + FP}

(32)

The Area Under the ROC Curve (AUC) serves as a performance measure for classification models across various threshold settings. It quantifies how well a classifier distinguishes between different classes.

4.4. Validation of Gabor-local feature extraction

The experimental evaluation systematically assesses the effectiveness of Gabor-local feature extraction (GLFE) by comparing different models. Specifically, the analysis contrasts models incorporating GLFE with those relying solely on conventional convolutional layers (Conv) or operating without (w/o) GLFE. As demonstrated in Table 1, the integration of GLFE consistently leads to performance improvements. This enhanced performance stems from GLFE’s unique ability to extract highly discriminative local features, a crucial advantage when processing the complex and nuanced patterns present in cholesteatoma images. Consequently, the results clearly indicate that models enhanced with GLFE exhibit superior class discrimination abilities compared to their conventional counterparts, showcasing the value of localized feature extraction.

Table 1.

Validation of Gabor-local feature extraction.

Method	Accuracy (%)	Sensitivity (%)	Specificity (%)	AUC
Conv	87.36	80.64	89.34	0.8721
w/o GLFE	87.58	85.18	88.10	0.8704
w/GLFE	90.85	89.65	91.12	0.9021

4.5. Effectiveness of contextual-global feature aggregation

This experiment evaluates the specific contribution of Contextual-Global Feature Aggregation (CGFA) by comparing model performance w/and w/o this module. The results, detailed in Table 2, demonstrate a clear performance increase when CGFA is incorporated (accuracy: 90.85%, sensitivity: 89.65%, specificity: 91.12%, AUC: 0.9021). This enhancement is attributed to CGFA’s ability to utilize attention scores, which effectively prioritize relevant features while suppressing irrelevant ones. By dynamically adjusting feature weights based on calculated relevance, CGFA generates more robust global descriptors. Overall, these findings confirm the effectiveness of CGFA in improving the accuracy of this classification task.

Table 2.

Effectiveness of contextual-global feature aggregation.

Method	Accuracy (%)	Sensitivity (%)	Specificity (%)	AUC
w/o CGFA	87.59	81.48	88.89	0.8595
w/CGFA	90.85	89.65	91.12	0.9021

4.6. Impact of dynamic feature fusion

This subsection investigates the effectiveness of Dynamic Feature Fusion (DFF) in integrating local, global, and residual block features within the model architecture. To isolate the specific contribution of DFF, we compare its performance against simpler, static fusion techniques such as concatenation, summation, and averaging, tested across various configurations. The experimental results, presented in Table 3, reveal that DFF’s approach, which dynamically balances the influence of these diverse feature sources, yields superior model performance compared to the static methods. This outcome highlights the importance of adaptive feature fusion strategies like DFF, demonstrating their capability to create more discriminative representations and thereby improve overall classification performance.

Table 3.

Influence of dynamic feature fusion.

Method	Accuracy (%)	Sensitivity (%)	Specificity (%)	AUC
Concatenation	88.78	85.71	88.06	0.8657
Summation	87.65	86.67	87.80	0.8599
Averaging	88.24	83.87	89.34	0.8691
w/o DFF	88.12	81.48	89.88	0.8551
w/DFF	90.85	89.65	91.12	0.9021

4.7. Influence of pooling methods

This experiment evaluates the impact of different pooling strategies on global feature aggregation within the image classification model, specifically comparing Average Pooling, Max Pooling, and GeM Pooling. The empirical results, documented in Table 4, clearly demonstrate that GeM Pooling achieves superior performance compared to both Average and Max Pooling. This suggests that GeM Pooling’s mechanism, which generalizes the pooling operation, is more effective at capturing and retaining discriminative global features essential for classification. Ultimately, the use of GeM Pooling contributes to a more robust feature representation, leading to enhanced model performance.

Table 4.

Influence of pooling methods.

Method	Accuracy (%)	Sensitivity (%)	Specificity (%)	AUC
Average Pooling	89.14	80.64	90.16	0.8562
Max Pooling	88.79	82.14	89.60	0.8543
GeM Pooling	90.85	89.65	91.12	0.9021

4.8. Effect of different residual network depths

This experiment investigates the relationship between the depth of residual blocks in the ResNet architecture and its performance on classifying normal middle ear and cholesteatoma images. Initially, as depicted in Figure 5, increasing network depth leads to improved performance. This gain is attributable to the enhanced feature extraction capabilities of deeper models, which allow them to capture more intricate patterns crucial for distinguishing between the two medical conditions. However, this positive trend reverses beyond a certain threshold (specifically, a depth of 50). At greater depths, the model becomes susceptible to overfitting by learning noise and specifics from the training data rather than generalizable patterns, which consequently diminishes its performance on unseen data.

Figure 5.

Effect of different residual network depths.

4.9. Effectiveness of various input sizes

As shown in Figure 6, the experimental results demonstrate variations in performance across different input image sizes. Through detailed analysis, we observe that larger input dimensions, particularly 112 × 112 and 224 × 224 pixels, generally yield superior classification performance compared to smaller input sizes, albeit at the cost of increased computational overhead. However, our experiments reveal an interesting phenomenon: the network shows diminishing returns in performance improvements beyond 224 × 224, suggesting this resolution represents an optimal balance point between performance and computational efficiency. Based on these findings, we ultimately adopt the size of 224 × 224 as it offers a compelling compromise, maintaining relatively high performance while effectively reducing computational requirements compared to larger input dimensions.

Figure 6.

Effectiveness of various input sizes.

4.10. Visualization of attention maps

As shown in Figure 7, the attention map visualizations demonstrate that the proposed model could concentrate on key pathological regions within cholesteatoma images. This suggests that the importance score mechanism effectively directs the model’s focus toward diagnostically relevant areas. By highlighting these critical regions, the model is able to integrate local and global features more accurately, thereby enhancing its overall representation ability. Consequently, this targeted attention contributes to the observed improvement in classification performance.

Figure 7.

Visualizations of attention maps.

4.11. Comparison with state-of-the-art models

In this section, we conduct a comprehensive comparison of our proposed method with other state-of-the-art approaches to evaluate its performance. Here, the comparison methods including: Vision Transformer,²⁰ DenseNet,¹⁸ ConvNeXt,²² SE-ResNet,¹⁹ BoTNet,²¹ iFormer,²³ VAN²⁵ and TransNeXt,²⁴MobileNetV2,¹⁴Inception V3 + SVM,¹³ MedViTV2s.³³ As illustrated in Table 5, the experimental results indicate that our method performs competitively against various models, showcasing its effectiveness in achieving superior results across multiple metrics. This competitive performance can be attributed to the use of Gabor-local feature extraction, which enables the model to capture essential local representations effectively. Additionally, the incorporation of contextual-global feature aggregation and dynamic feature fusion further enhances the model’s ability to integrate and leverage broader contextual information, ultimately contributing to its overall performance.

Table 5.

Comparison with state-of-the-art models.

Method	Accuracy (%)	Sensitivity (%)	Specificity (%)	AUC
VIT²⁰	88.98	81.48	89.77	0.8862
DenseNet¹⁸	89.32	88.89	89.68	0.8589
ConvNeXt²²	89.87	83.87	90.98	0.8906
SE-ResNet¹⁹	90.35	85.18	89.75	0.8810
BoTNet²¹	89.75	86.67	88.62	0.8471
iFormer²³	89.67	85.18	88.86	0.8939
VAN²⁵	88.89	81.48	90.48	0.8645
TransNeXt²⁴	90.20	88.92	90.48	0.9015
MobileNetV2¹⁴	90.03	89.51	90.56	0.8911
Inception V3 + SVM¹³	89.86	88.81	90.91	0.8910
MedViTV2s³³	90.21	89.51	90.91	0.9020
Ours	90.85	89.65	91.12	0.9021

5. Discussion and conclusion

In this paper, our proposed model showcases superior performance in classifying cholesteatoma from normal middle ear images compared to state-of-the-art methods, emphasizing the efficacy of its integrated design. Specifically, the incorporation of Gabor filtering alongside the ResNet architecture effectively improves local feature extraction, enabling the model to capture nuanced textural and structural patterns critical for distinguishing cholesteatoma. Furthermore, the contextual-global feature aggregation enhances the model by generating an importance score map, enabling it to dynamically focus on the most salient image regions and facilitating the creation of a robust, global feature descriptor via GeM pooling. Additionally, the proposed dynamic feature fusion strategy combines Gabor-local features, contextual-global features, and deep semantic features from the final residual block based on their computed importance, supporting a more comprehensive multi-scale analysis. Clinically, this model holds promise by enhancing diagnostic performance, which is crucial for accurately distinguishing between cholesteatoma, requiring timely intervention, and a normal middle ear that does not. This improvement helps reduce misdiagnosis, ensure appropriate and timely patient management, and potentially decrease morbidity, avoid unnecessary surgeries, and optimize the use of healthcare resources in otolaryngology. Besides, this paper is conducted at a single center with a retrospective design, which introduces selection bias and limits the generalizability of the results. The dataset consists exclusively of CT images and does not include MRI data, which provides complementary diagnostic information. To address these limitations, we plan a prospective multicenter study to validate the model in a larger and more diverse patient population, incorporating multimodal imaging data, including MRI, for comprehensive evaluation.

Footnotes

ORCID iD

Guokai Zhang

Author contributions

Jianqing Chen contributed to investigation, data curation, conceptualization, writing the original draft, and project administration. Guokai Zhang was responsible for methodology, and formal analysis. Yanran Wang handled visualization, and writing review and editing. Yuqing Sun supervised the project, validated results, and contributed to writing review and editing.

Funding

The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: Funding Sponsored by the Interdisciplinary Program of Shanghai Jiao Tong University (YG2019QNB12).

Declaration of conflicting interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

References

Wei

Zhou

Zheng

, et al. Congenital cholesteatoma clinical and surgical management. International Journal of Pediatric Otorhinolaryngology 2023; 164: 111401. https://doi.org/10.1016/j.ijporl.2022.111401

Pachpande

Singh

. Diagnosis and treatment modalities of cholesteatomas: a review. Cureus 2022; 14(11): e31153. https://doi.org/10.7759/cureus.31153

Wang

Y-M

Cheng

Y-S

, et al. Deep learning in automated region proposal and diagnosis of chronic otitis media based on computed tomography. Ear and Hearing 2020; 41(3): 669–677. https://doi.org/10.1097/AUD.0000000000000794

Byun

, et al. An assistive role of a machine learning network in diagnosis of middle ear diseases. Journal of Clinical Medicine 2021; 10(15): 3198. https://doi.org/10.3390/jcm10153198

Eroğlu

Yıldırım

, et al.

Is it useful to use computerized tomography image-based artificial intelligence modelling in the differential diagnosis of chronic otitis media with and without cholesteatoma?

American Journal of Otolaryngology 2022; 43(3): 103395. https://doi.org/10.1016/j.amjoto.2022.103395

Duan

Pan

L-L

Chen

W-X

, et al. An in-depth discussion of cholesteatoma, middle ear inflammation, and langerhans cell histiocytosis of the temporal bone, based on diagnostic results. Frontiers in pediatrics 2022; 10: 809523. https://doi.org/10.3389/fped.2022.809523

Wang

Song

, et al. Structure-aware deep learning for chronic middle ear disease. Expert Systems with Applications 2022; 194: 116519. https://doi.org/10.1016/j.eswa.2022.116519

Miwa

Minoda

Yamaguchi

, et al. Application of artificial intelligence using a convolutional neural network for detecting cholesteatoma in endoscopic enhanced images. Auris Nasus Larynx 2022; 49(1): 11–17. https://doi.org/10.1016/j.anl.2021.03.018

Tseng

Lim

Jyung

. Use of artificial intelligence for the diagnosis of cholesteatoma. Laryngoscope investigative otolaryngology 2023; 8(1): 201–211. https://doi.org/10.1002/lio2.1008

10.

Takahashi

Noda

Yoshida

, et al. Preoperative prediction by artificial intelligence for mastoid extension in pars flaccida cholesteatoma using temporal bone high-resolution computed tomography: A retrospective study. PLoS One 2022; 17(10): e0273915. https://doi.org/10.1371/journal.pone.0273915

11.

Cao

Song

, et al. Structure-constrained deep feature fusion for chronic otitis media and cholesteatoma identification. Multimedia Tools and Applications 2023; 82(29): 45869–45889. https://doi.org/10.1007/s11042-023-15425-7

12.

Chen

Sun

, et al. A 3d and explainable artificial intelligence model for evaluation of chronic otitis media based on temporal bone computed tomography: model development, validation, and clinical application. Journal of Medical Internet Research 2024; 26: e51706. https://doi.org/10.2196/51706

13.

Ouattassi

Maaroufi

Slaoui

, et al. Middle ear-acquired cholesteatoma diagnosis based on ct scan image mining using supervised machine learning models. Beni-Suef University Journal of Basic and Applied Sciences 2024; 13(1): 78. https://doi.org/10.1186/s43088-024-00534-5

14.

Ayral

Türk

Can

, et al.

How advantageous is it to use computed tomography image-based artificial intelligence modelling in the differential diagnosis of chronic otitis media with and without cholesteatoma?

European Review for Medical & Pharmacological Sciences 2023; 27(1): 215–223. https://doi.org/10.26355/eurrev_202301_30874

15.

Eroğlu

Yıldırım

, et al. Comparison of computed tomography-based artificial intelligence modeling and magnetic resonance imaging in diagnosis of cholesteatoma. The Journal of International Advanced Otology 2023; 19(4): 342–349. https://doi.org/10.5152/iao.2023.221004

16.

Emilio

De Luca

Max

, et al.

How reliable is artificial intelligence in the diagnosis of cholesteatoma on ct images?

American Journal of Otolaryngology 2024; 46: 104519. https://doi.org/10.1016/j.amjoto.2024.104519

17.

Aktar Uğurlu

Numan Uğurlu

. Exploring trends and developments in cholesteatoma research: a bibliometric analysis. European Archives of Oto-Rhino-Laryngology 2024; 281(10): 5199–5210. https://doi.org/10.1007/s00405-024-08749-z

18.

Huang

Liu

Maaten

LVD

, et al. Densely connected convolutional networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 4700–4708.

19.

Shen

Sun

. Squeeze-and-excitation networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 7132–7141.

20.

Dosovitskiy

Beyer

Kolesnikov

, et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 2020.

21.

Srinivas

Lin

T-Y

Parmar

, et al. Bottleneck transformers for visual recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 16519–16529.

22.

Liu

Mao

C-Y

, et al. A convnet for the 2020s. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 11976–11986.

23.

Zhou

, et al. Inception transformer. Advances in Neural Information Processing Systems 2022; 35: 23495–23509.

24.

Dai

. Transnext: Robust foveal visual perception for vision transformers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 17773–17783.

25.

Guo

M-H

C-Z

Liu

Z-N

, et al. Visual attention network. Computational Visual Media 2023; 9(4): 733–752. https://doi.org/10.1007/s41095-023-0364-2

26.

Tian

Zuo

, et al. Designing and training of a dual cnn for image denoising. Knowledge-Based Systems 2021; 226: 106949. https://doi.org/10.1016/j.knosys.2021.106949

27.

Utomo

Juniawan

Vincent

, et al. Local features based deep learning for mammographic image classification: in comparison to cnn models. Procedia Computer Science 2021; 179: 169–176. https://doi.org/10.1016/j.procs.2020.12.022

28.

Mou

Chang

Zhou

, et al. Multimodal driver distraction detection using dual-channel network of cnn and transformer. Expert Systems with Applications 2023; 234: 121066. https://doi.org/10.1016/j.eswa.2023.121066

29.

Wang

, et al. Conv-swinformer: Integration of cnn and shift window attention for alzheimer’s disease classification. Computers in Biology and Medicine 2023; 164: 107304. https://doi.org/10.1016/j.compbiomed.2023.107304

30.

Cheng

Xiao

Sun

, et al. Cnn-transformers for mineral prospectivity mapping in the maodeng–baiyinchagan area, southern great xing’an range. Ore Geology Reviews 2024; 167: 106007.

31.

Wang

Zou

Alfarraj

, et al. Image super-resolution method based on the interactive fusion of transformer and cnn features. The Visual Computer 2024; 40(8): 5827–5839. https://doi.org/10.1007/s00371-023-03138-9

32.

Mei

Zhang

, et al. Bridging cnn and transformer with cross attention fusion network for hyperspectral image classification. IEEE Transactions on Geoscience and Remote Sensing 2024; 62: 1-14.

33.

Nejati Manzari

Asgariandehkordi

Koleilat

, et al. Medical image classification with kan-integrated transformers and dilated neighborhood attention. arXiv preprint arXiv:2502.13693 2025.

34.

Chan

JC-W

Wang

. Automatic depression recognition using cnn with attention mechanism from videos. Neurocomputing 2021; 422: 165–175. https://doi.org/10.1016/j.neucom.2020.10.015

35.

Wang

Liu

Peng

, et al. Feature-level attention-guided multitask cnn for fault diagnosis and working conditions identification of rolling bearing. IEEE transactions on neural networks and learning systems 2021; 33(9): 4757–4769. https://doi.org/10.1109/TNNLS.2021.3060494

36.

Wei

. Multiscale cnn based on component analysis for sar atr. IEEE Transactions on Geoscience and Remote Sensing 2021; 60: 1–12.

37.

Liu

Liang

, et al. High resolution sar image classification using global-local network structure based on vision transformer and cnn. IEEE Geoscience and Remote Sensing Letters 2022; 19: 1–5. https://doi.org/10.1109/lgrs.2022.3151353

38.

Xiong

Chen

Niu

, et al. A global and local feature fused cnn architecture for the semg-based hand gesture recognition. Computers in Biology and Medicine 2023; 166: 107497. https://doi.org/10.1016/j.compbiomed.2023.107497

39.

Zhao

Asif

Chen

, et al. Glnet: global–local cnn’s-based informed model for detection of breast cancer categories from histopathological slides. The Journal of Supercomputing 2024; 80(6): 7316–7348. https://doi.org/10.1007/s11227-023-05742-x

40.

Duan

Luo

Zhang

. Combining transformers with cnn for multi-focus image fusion. Expert Systems with Applications 2024; 235: 121156. https://doi.org/10.1016/j.eswa.2023.121156

41.

Khan

Chefranov

Demirel

. Image scene geometry recognition using low-level features fusion at multi-layer deep cnn. Neurocomputing 2021; 440: 111–126. https://doi.org/10.1016/j.neucom.2021.01.085

42.

Ding

Zhang

Zhao

, et al. Multi-feature fusion: Graph neural network and cnn combining for hyperspectral image classification. Neurocomputing 2022; 501: 246–257. https://doi.org/10.1016/j.neucom.2022.06.031

43.

Ding

Zhang

, et al. Motor imagery eeg classification algorithm based on cnn-lstm feature fusion network. Biomedical signal processing and control 2022; 72: 103342. https://doi.org/10.1016/j.bspc.2021.103342

44.

Liu

, et al. A cnn framework with slow-fast band selection and feature fusion grouping for hyperspectral image change detection. IEEE transactions on geoscience and remote sensing 2022; 60: 1–16. https://doi.org/10.1109/tgrs.2022.3156041

45.

Xiao

Yang

Wang

, et al. A feature fusion enhanced multiscale cnn with attention mechanism for spot-welding surface appearance recognition. Computers in Industry 2022; 135: 103583. https://doi.org/10.1016/j.compind.2021.103583

46.

Ul Haq

Ali

Wang

, et al. Feature fusion and ensemble learning-based cnn model for mammographic image classification. Journal of King Saud University-Computer and Information Sciences 2022; 34(6): 3310–3318. https://doi.org/10.1016/j.jksuci.2022.03.023

47.

Zhang

Guo

Chen

, et al. Wildfire detection via a dual-channel cnn with multi-level feature fusion. Forests 2023; 14(7): 1499. https://doi.org/10.3390/f14071499

48.

Zhang

Ren

, et al. Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.