Sage Journals: Discover world-class research

Abstract

Auditory brainstem response (ABR) interpretation in clinical practice often relies on visual inspection by audiologists, which is prone to inter-practitioner variability. While deep learning (DL) algorithms have shown promise in objectifying ABR detection in controlled settings, their applicability to real-world clinical data is hindered by small datasets and insufficient heterogeneity. This study evaluates the generalizability of nine DL models for ABR detection using large, multicenter datasets. The primary dataset analyzed, Clinical Dataset I, comprises 128,123 labeled ABRs from 13,813 participants across a wide range of ages and hearing levels, and was divided into a training set (90%) and a held-out test set (10%). The models included convolutional neural networks (CNNs; AlexNet, VGG, ResNet), transformer-based architectures (Transformer, Patch Time Series Transformer [PatchTST], Differential Transformer, and Differential PatchTST), and hybrid CNN-transformer models (ResTransformer, ResPatchTST). Performance was assessed on the held-out test set and four external datasets (Clinical II, Southampton, PhysioNet, Mendeley) using accuracy and area under the receiver operating characteristic curve (AUC). ResPatchTST achieved the highest performance on the held-out test set (accuracy: 91.90%, AUC: 0.976). Transformer-based models, particularly PatchTST, showed superior generalization to external datasets, maintaining robust accuracy across diverse clinical settings. Additional experiments highlighted the critical role of dataset size and diversity in enhancing model robustness. We also observed that incorporating acquisition parameters and demographic features as auxiliary inputs yielded performance gains in cross-center generalization. These findings underscore the potential of DL models—especially transformer-based architectures—for accurate and generalizable ABR detection, and highlight the necessity of large, diverse datasets in developing clinically reliable systems.

Keywords

auditory brainstem response objective detection deep learning generalizability multicenter validation

Introduction

Auditory brainstem response (ABR) is the early component of auditory evoked potentials, and commonly occurs within the first 10 ms after the stimulus onset (Ballachanda et al., 1992; Jewett & Williston, 1971). A typical ABR consists of seven vertex-positive waves labeled I–VII, of which I–V are usually investigated, with wave V being most prominent (Boettcher, 2002; Tsutsui et al., 1986). The ABR threshold, usually defined as the lowest stimulus intensity capable of evoking wave V, is clinically useful in objective assessment of hearing ability (Dobrowolski et al., 2016; Ren et al., 2016; Valderrama et al., 2012), particularly necessary for patients whose behavioral test results are unreliable. Determining whether an ABR, here particularly referred to as wave V, is present at various intensities is an essential step for this application. However, this process often requires visual inspection and subjective judgment by human experts (Lightfoot et al., 2019), which is time-consuming, costly, and highly variable (Vidler & Parker, 2004; Wimalarathna et al., 2022). Disagreements arise among even experienced audiologists and different trials on the same ABR data by the same expert (Pratt et al., 1995; Stueve & O'Rourke, 2003; Zaitoun et al., 2014), particularly as the stimulus intensity is decreased to threshold level. For these reasons, a computer-aided diagnosis method for objective and automated ABR detection is desirable to assist human experts.

Previous attempts to automate ABR detection for estimating hearing thresholds have primarily included statistical methods and artificial intelligence (AI) techniques. Statistical methods, such as Fsp (Chesnaye et al., 2018; Elberling & Don, 1984), Fmp (Chesnaye et al., 2018; McKearney, 2023), q-sample uniform scores test (Chesnaye et al., 2018; Stürzebecher et al., 1999), Hotelling's T2 test (Chesnaye et al., 2018, 2019), cross-covariance analysis (Suthakar & Liberman, 2019; Tanaka et al., 2023), Pearson product-moment correlation (Arnold, 1985; Wang et al., 2021; Weber & Fletcher, 1980), cross-correlation against a template (Davey et al., 2003, 2007; Elberling, 1979) or interleaved responses (Berninger et al., 2014), have been used to model relationships between variables and make predictions based on statistical assumptions. In contrast, AI techniques appear to exhibit stronger predictive power by learning complex relationships and temporal patterns within data in a data-driven manner, which are beneficial for ABR detection.

Machine learning (ML) and deep learning (DL) algorithms are subsets of AI. Several studies have leveraged ML to detect the ABR. With 285 ABR recordings from 10 ears available, Alpsan (1991) proposed a three-layered artificial neural network (ANN) to detect the presence or absence of ABR, achieving an accuracy of 74.9%. Acır et al. (2006) trained and evaluated a support vector machine (SVM) classifier on 648 ABR recordings from 36 adult normal-hearing ears, with the highest accuracy of 97.7% attained using discrete cosine transform coefficients as input features. Davey et al. (2007) proposed hybrid classification models combining ANN and C5.0 decision tree algorithms for automated ABR detection, using time, frequency, and cross-correlation measures from 550 ABR recordings of 85 subjects. They reported 95.6% accuracy for strong responses and 85% for weak responses. Based on the same dataset, McCullagh et al. (2007) extracted features in both the time and wavelet domains, achieving 83.4% accuracy with a Naïve Bayes classifier. These studies demonstrate that when combined with proper feature engineering methods, ML algorithms can achieve promising performance in ABR detection. However, they heavily rely on manual feature engineering, which requires considerable expertise.

Recently, the application of DL has enabled ABR detection without the need for manual feature extraction on waveforms, as demonstrated in a few studies. McKearney and MacKinnon (2019) introduced a one-dimensional convolutional neural network (CNN) to classify ABRs into three classes: “response present,” “response absent,” and “inconclusive”. The network was trained and tested on 232 paired ABR waveforms from 8 normal-hearing individuals, achieving an overall accuracy of 92.9%. McKearney et al. (2022) trained four types of ML and DL models on simulated data that were generated based on ABRs recorded from 12 normal-hearing participants and no-stimulus EEG data from 15 participants to detect the presence or absence of ABR. Their stacked ensemble model, which combined CNN-long short-term memory (CNN-LSTM) and random forest models, achieved an area under the receiver operating characteristic (ROC) curve of approximately 0.975. Liang et al. (2024) proposed the Wide & Deep model, which integrated a five-layer MLP model with time and frequency features, along with demographic factors, and a CNN-BiLSTM-Attention model using original and denoised signal sequences as inputs. Trained and evaluated on 2,556 ABR waveforms from 100 participants, this model achieved an accuracy of 91.0%. However, the inclusion of the pure-tone thresholds may have led to an overestimation of the model's true performance.

Despite the promising results, DL models have not yet been implemented in clinical practice, partly due to concerns about their generalizability. First, the scarcity of ABR data, typically with fewer than 100 participants, hinders the application of DL models that are always data-hungry for strong generalizability. Second, previous DL models have predominantly been trained on experimental datasets comprising young adults with normal hearing. This lack of diversity in training data can lead to poor generalizability, especially when applied to heterogeneous populations in real-world clinical settings (e.g., infants, children, and the elderly with and without hearing loss). It is important for training data to encompass the populations for which it will be used, including individuals with varying ages, sexes and hearing statuses. For example, there are known latency differences in the ABR between infants and adults (Moore et al., 1995) and age-related amplitude changes (Grose et al., 2019). Sex-related differences have also been observed, with males typically exhibiting longer latencies and smaller amplitudes compared to females (Aloufi et al., 2023; Dehan & Jerger, 1990). Additionally, age-related hearing loss has led to a decrease in the amplitude of ABR peaks (Frisina et al., 2016). Lastly, there is a lack of external validation since the models have been trained and tested on a single cohort. Moreover, none of the studies have comprehensively compared various DL algorithms on a large body of ABR data, a valuable step in recommending suitable models for clinical ABR detection.

To address these issues, the primary goal of this study is to develop generalizable DL models to detect the presence of wave V in the ABR using a large cohort of real-world clinical data. We evaluate and compare multiple DL models for detecting the ABR across multicenter datasets. The primary clinical dataset analyzed includes a total of 128,123 responses from 13,813 participants of varying ages and hearing statuses, ensuring better generalizability to clinical settings. Additionally, the trained models are externally validated on an independent clinical dataset and three publicly available datasets. The secondary goal is to systematically investigate the effect of training data size and diversity on the model's generalization performance. To achieve this, we train our model on subsets of varying sizes and compare its performance with that of the full training set. Furthermore, the model is trained on datasets with restricted age and hearing status groups and then validated on unseen groups, emphasizing the importance of diverse data for training generalizable DL models.

Materials and Methods

Datasets

In this study, we utilized a large clinically recorded dataset, referred to as “Clinical Dataset I,” for our main analysis. Additionally, an independent clinical dataset from a different hospital, referred to “Clinical Dataset II,” and three publicly available experimental datasets, namely the Southampton Dataset, PhysioNet Dataset, and Mendeley Dataset, were used for external validation. An overview of the properties of these datasets is shown in Table 1.

Table 1.

Description of Datasets.

Datasets	Clinical dataset I	Clinical dataset II	Southampton	PhysioNet	Mendeley
Number of subjects	13,813	247	12	8	8
Number of ABR samples	128,123	3,560	72	208	95
Age (years)	0–95	0–83	18–30	19–31	23–73
Gender, female	6,366	98	6	4	-
Hearing status	Normal and varying levels of hearing loss	Normal and varying levels of hearing loss	Normal	Normal	-
Equipment	Eclipse, Interacoustics Inc., Denmark	SmartEP, Intelligent Hearing System Inc., United States	CED micro 1,401 data acquisition unit	32-bit Lynx Two Soundcard	SmartEP, Intelligent Hearing System Inc., United States
Stimulus	Click	Click	Click	1 and 4 kHz tone pips	Click
Stimulus rate	20.1, 31.5, or 44.4 clicks/s	11.1, 19.3, or 21.1 clicks/s	33.3 clicks/s	23.97 clicks/s	37.1 clicks/s
Electrodes	Cz position, middle forehead (ground), and bilateral mastoids (references)	Cz position, middle forehead (ground), and bilateral mastoids (references)	Vertex, nape of the neck (reference), and low forehead (ground)	Forehead (non-inverting electrode), ipsilateral mastoid (inverting electrode), contralateral mastoid (ground)	-
Sampling rate	30 kHz	40 kHz	5 kHz	48 kHz	20 kHz
Band-pass frequencies	0.033–3 kHz	0.1–3 kHz	0.03–1.5 kHz	0.03–3 kHz	0.1–3 kHz
Artifact rejection on cut-off level	40 $μ V$	40 $μ V$	-	50 $μ V$	31 $μ V$

− indicates information not reported in that study. CED = Cambridge electronic design; ABR = auditory brainstem response.

Clinical Dataset I: This study primarily analyzed Clinical Dataset I, comprising a total of 128,123 ABR samples from 13,813 participants (6,366 females, age: 0–95 years, mean $\pm$ standard deviation (SD): 21.1 $\pm$ 24.6 years). The data were collected between January 2, 2018, and April 15, 2023, at the Department of Otorhinolaryngology Head and Neck Surgery, Beijing Tongren Hospital. The dataset includes a variety of people with varying ages and hearing statuses, with detailed demographic information and data distribution presented in Figure 1(a), (c), and (e).

Figure 1.

Participant demographics and data distribution for Clinical Datasets I and II. (a), (b) Age group distribution by gender: Infants: ( $<$ 6 months), children (6 months-18 years), adults (18–60 years), elderly ( $>$ 60 years); (c), (d) histogram showing the number of responses recorded from each ear; (e), (f) distribution of severity of hearing loss across age groups, defined based on average hearing thresholds at 0.5, 1, 2, and 4 kHz in the better ear. The categories are as follows: Normal hearing: Threshold $<$ 20 dB, mild hearing loss: 20 dB $\leq$ threshold $<$ 35 dB, moderate hearing loss: 35 dB $\leq$ threshold $<$ 50 dB, moderate-severe hearing loss: 50 dB $\leq$ threshold $<$ 65 dB, severe hearing loss: 65 dB $\leq$ threshold $<$ 80 dB, profound hearing loss: 80 dB $\leq$ threshold $<$ 95 dB, and complete or total hearing loss: threshold $\geq$ 95 dB (World Health Organization, 2021). Clinical Datasets I and II have 5,824 and 94 individuals who have not undergone hearing testing, respectively.

ABR testing was conducted in a sound-treated chamber with participants lying on a bed comfortably, using Eclipse evoked potential test system (Interacoustics Inc., Denmark) as part of routine clinical audiological assessments. A 100 $μ s$ click stimulus was presented in alternating polarity at rates of 20.1/s, 31.5/s, or 44.4/s. ABRs were recorded with a noninverting electrode placed at Cz, a ground electrode at middle forehead (Fz), and reference electrodes at the left and right mastoids using a sampling rate of 30 kHz. The recorded signals were band-pass filtered from 0.033 to 3 kHz. The artifact rejection cut-off level was set at $\pm$ 40 $μ V$ . This resulted in a mean of ∼ 1,180 (SD: ∼ 600) artifact-free recording epochs per averaged waveform. Each waveform consisted of 360 data points corresponding to 0–12 ms from the stimulus onset.

The click intensities ranged from 100, 90, or 80 dBnHL (dB scale relative to normal hearing level) to the participant's ABR threshold in steps of 10 or 5 dB. Figure 2 provides four examples of ABR data recorded at a range of stimulus intensities. Labeling followed the criteria described by Sutton et al. (2013) and involved two stages. In the first stage (between January 2, 2018 and April 15, 2023), one of six experienced audiologists labeled the waveform immediately after ABR testing at each intensity, then reviewed the threshold based on the series of waveforms. Each waveform was labeled as either ‘‘response present (RP)’’ or ‘‘response absent (RA)’’ based on the presence or absence of peak V. In the second stage (between April 17, 2023 and December 28, 2023), two audiologists with over five years of experience independently inspected the series of waveforms and checked the labels. In cases of disagreement, a third expert was consulted to reach a consensus. In this dataset, 65% of ABR waveforms were labeled as RP and the remaining 35% as RA. These labels served as the ground truth for model training and testing.

Figure 2.

Examples of auditory brainstem response (ABR). Four examples of ABR at a range of stimulus levels for hearing threshold estimation. Wave V is labeled where assessed as present.

Clinical Dataset II: This dataset was collected between June 25, 2023, and July 18, 2023, at the Department of Otorhinolaryngology & Hearing and Speech Rehabilitation, the First Affiliated Hospital of Chongqing Medical University. It includes 3,560 ABR waveforms from 247 participants (98 females) aged 0–83 years (mean $\pm$ SD: 26.2 $\pm$ 25 years). Detailed demographic information and data distribution are presented in Figure 1(b), (d), and (f). A 100 $μ s$ click stimulus was delivered in alternating polarity at rates of 19.3, 21.1, or 11.1 clicks/s. ABRs in Clinical Dataset II were recorded using SmartEP evoked potential test system (Intelligent Hearing System Inc., USA), with the same electrode placement as in Clinical Dataset I. Responses were sampled at 40 kHz and filtered between 0.1 and 3 kHz, with artifact rejection cut-off level set at 40 $μ V$ . Each waveform consisted of 512 data points corresponding to 0–12.8 ms from the stimulus onset. The labeling procedure of Clinical Dataset II was similar to that of Clinical Dataset I but conducted by different audiologists.

Southampton Dataset: The ABR data were recorded from 12 normal-hearing participants (6 females) aged 18–30 years (Lv et al., 2007). A rectangular 100 $μ s$ click stimulus was delivered at 33.3 Hz, ranging from 0 to 50 dB sensation level in 10 dB steps. Electrodes were placed at the vertex, nape (reference), and low forehead (ground). Responses were sampled at 10 kHz using a CED micro 1,401 data acquisition unit along with a CED 1902 amplifier, then band-pass filtered (30–1,500 Hz) and down-sampled to 5 kHz. Each waveform had 30.03 ms interval following the stimulus onset. The data, available at: http://doi.org/10.5258/SOTON/D0168, were visually labeled by two independent audiologists based on the British Society of Audiology criteria (Lightfoot et al., 2019; McKearney et al., 2022), with additional labeling by our invited third audiologist (Xinxing Fu). The independent ratings of the three audiologists are summarized in Table I-1 in Supplemental Digital Content (SDC) 1. The inter-observer agreement between audiologists for the Southampton Dataset is 93.1%–94.4% (see Table I-2 in SDC 1).

PhysioNet Dataset: The data were contributed by Silva and Epstein (2010), collected from 8 normal-hearing participants (4 females) aged 19–31 years. The stimuli were 1 and 4 kHz tone pips presented at 23.97/s, from 5 dB below the participant's threshold to 100 dB peak-equivalent SPL in 5 dB steps. ABRs were recorded using electrodes placed on the forehead, ipsilateral mastoid (reference), and contralateral mastoid (ground). Measurements were obtained using a 32-bit Lynx Two Soundcard, sampled at 48 kHz, and band-pass filtered between 30 and 3,000 Hz. The 0–41.7 ms intervals following the stimulus onset were saved for analysis. The artifact rejection level was set at $\pm 50 μ V .$ Data are available at: https://physionet.org/content/earndb/1.0.0/. ABR labeling was conducted by our invited audiologist (Xinxing Fu), with results presented in Table II in SDC 1.

Mendeley Dataset: This Dataset, made available by Wang et al. (2021), comprised 8 participants aged 23–73 years, with no medical details provided. Stimuli (100 $μ s$ duration, rectangular envelopes) were presented at 37.1/s in alternating polarity. ABRs were sampled using SmartEP software (Intelligent Hearing System Inc., USA) at 20 kHz, with an artifact rejection threshold of 31 $μ V$ and a band-pass filter of 100–3,000 Hz. Stimulus intensities ranged from 60 to 0 dB SPL in 5 dB steps. Data are available at: https://data.mendeley.com/datasets/4yb9772dff/1. Five clinicians visually assessed ABR thresholds independently, which we converted into binary classifications: ‘‘response present’’ for suprathreshold and ‘‘response absent’’ for subthreshold stimuli. Additionally, a clinician (Xinxing Fu) was invited to determine the waveform labels. Independent ratings and interobserver agreement among the six audiologists are summarized in Table III-1 and Table III-2 in SDC 1.

Data Partition

Figure 3 illustrates the workflow of DL development and evaluation. Clinical Dataset I was used for our main analysis, split into 90% for model development (115,409 ABRs from 12,256 individuals) and 10% for testing (12,714 ABRs from 1,557 individuals). The 90% data were further divided into nine folds, with eight used for training and one for internal validation while fine-tuning hyper-parameters. The 10% data, referred to as the “held-out test set,” was used for model testing as it was wholly separate from model training. There was no overlap of subjects between subsets to reflect how well the DL models generalized to unseen ABR data.

Figure 3.

Overview of the proposed deep learning framework for detecting auditory brainstem responses (ABRs). (a) Data preparation: Clinical Dataset I is split into 90% for training and 10% as the held-out test set, while Clinical Dataset II, along with the Southampton, PhysioNet, and Mendeley datasets, are used for external validation. (b) Model development: The training set is used for model development, including hyperparameter tuning using 9-fold cross-validation, followed by retraining with the optimized hyperparameters. (c) Model evaluation: The final models classify ABRs as response present or response absent, with performance assessed on five test sets to evaluate generalization across multicenter datasets.

Additionally, multicenter validation was conducted using Clinical Dataset II, Southampton Dataset, PhysioNet Dataset, and Mendeley Dataset as independent test sets, ensuring robust evaluation across diverse datasets and demonstrating the models’ generalizability in a multicenter context.

Data Preprocessing

First, the original ABR waveforms in Clinical Dataset I were down-sampled to 15 kHz, compressing each waveform to 180 data points for optimized memory usage and processing efficiency. Normalization was essential to ensure generalization across datasets with varying magnitude scales. Specifically, the waveform with the maximum amplitude in the set from one ear was transformed to the −1 to 1 range, while all the other waveforms in the same set were scaled proportionally. To align with Clinical Dataset I, a window of 0–12 ms was applied to all waveforms from Clinical Dataset II, Southampton, PhysioNet, and Mendeley Datasets, which were then re-sampled to 15 kHz and normalized using the same procedure. In this way, each ear had its own normalization factor rather than a global one, preserving relative amplitude information within each ear while mitigating interear variability. This process ensured that all datasets, despite their variations in amplitude, were represented on a uniform scale, facilitating effective model generalization across multicenter datasets.

Model Architectures

This study developed nine DL models for ABR detection, utilizing a variety of architectures. Three CNN-based models—AlexNet, VGG, and residual neural network (ResNet)—were designed to extract local ABR features. Additionally, four Transformer-based models were implemented: Transformer, Patch Time Series Transformer (PatchTST), Differential Transformer (DiffTransformer), and Differential PatchTST (DiffPatchTST), which leverage self-attention mechanisms to capture temporal dependencies in the ABR waveforms. To combine the strengths of CNNs and attention mechanisms, two hybrid models were introduced: ResTransformer (ResNet + Transformer) and ResPatchTST (ResNet + PatchTST). All models were trained using scaled ABR waveforms, represented as one-dimensional time series with 180 data points. The architectures of these models are illustrated in Figure 4.

Figure 4.

Architectures of various DL models. CNN-based models: AlexNet, VGG, and ResNet; Transformer-based models: Transformer, PatchTST, DiffTransformer and DiffPatchTST; Hybrid models: ResTransformer and ResPatchTST. Note. DL = deep learning; CNN = convolutional neural network; PatchTST = patch time series transformer; DiffTransformer = differential transformer; DiffPatchTST = differential PatchTST.

AlexNet: AlexNet, developed by Krizhevsky et al. (2012), is a deep CNN comprising five convolutional (Conv) and two fully connected (FC) layers. The rectified linear unit (ReLU) nonlinearity is applied after each Conv and FC layer. The output from the final FC layer is fed into a softmax layer that produces probabilities of the RP or RA classes.

VGG: The VGG network (Simonyan, 2014) increases depth by stacking small convolutions. It comprises four blocks, each containing two Conv layers, two batch normalization (BN) layers, and ReLU activation. Each block is interconnected by a max pooling layer to reduce the size of feature maps. The resultant feature vectors are then passed through a detection module consisting of two FC layers and a softmax layer to generate labels of RP or RA.

ResNet: ResNet (He et al., 2016) incorporates residual connections to mitigate gradient issues in deep networks. In this study, the ResNet architecture starts with a Conv layer and a max pooling layer, followed by three residual blocks configured with Conv layers having 128, 256, and 512 filters and kernel size of 1 $\times$ 3. Each residual block contains two Conv layers, each equipped with BN and ReLU activation and connected by a shortcut connection. After the residual blocks, an adaptive pooling layer reduces feature dimensions, followed by two FC layers and a softmax layer to produce the final output.

Transformer: Transformer (Vaswani, 2017), known for its self-attention mechanism, is particularly effective at capturing long-range dependencies. In this study, the Transformer model is used for ABR detection by treating each point in the ABR time series as a token. These tokens are transformed into queries, keys, and values to compute attention. The model consists of 12 encoder blocks, each with a multihead self-attention layer and a feed-forward neural network. The attention mechanism captures long-range dependencies in the ABR signal, allowing the model to focus on relevant patterns across the sequence. The outputs from the self-attention layers are processed through feed-forward networks (FFN) and FC layers, followed by a softmax layer for detecting the presence of ABR.

PatchTST: PatchTST (Nie et al., 2022) combines Transformer architecture with a time series patching method to enhance ABR detection. The ABR time series is divided into 50 overlapping patches, each containing 33 data points with a stride of 3. These patches are viewed as tokens and fed into 10 Transformer encoder layers, followed by two FC layers and a softmax layer for detection results. The PatchTST not only effectively captures local temporal information while improving computation efficiency via the patching method but also models the long-range dependencies in the ABR through the self-attention mechanism.

DiffTransformer: The DiffTransformer (Ye et al., 2024) is a modified version of the traditional Transformer by introducing a differential attention mechanism to reduce ‘‘attention noise.’’ It splits the attention matrices into two components and uses a learnable scalar λ to balance their attention scores, improving the model's focus on key features. The architecture has 12 encoder layers, each containing a differential self-attention layer and an FFN. The output of these layers is processed through two FC layers and a softmax layer to produce the final results. This design allows for improved retrieval of critical information and then enhances model performance.

DiffPatchTST: The DiffPatchTST model modifies PatchTST by replacing its standard attention mechanism with the differential attention mechanism. It consists of 10 encoder layers, followed by two FC layers and a softmax layer for final classification.

ResTransformer: The ResTransformer is a hybrid model combining the elements of ResNet and the Transformer. It consists of a residual Conv block, followed by a Transformer encoder layer, and ends with a pooling layer. This basic unit is repeated three times, followed by FC layers for final classification.

ResPatchTST: The ResPatchTST combines features learned from the ResNet and PatchTST. These combined features are fed into two FC layers and a softmax layer to obtain the final detection results.

Model Training and Evaluation

To address the class imbalance in the training dataset, the model was trained using weighted cross-entropy loss, where each class was assigned a weight inversely proportional to its sample size, ensuring equal contribution during training. We employed a combination of grid and random search to optimize the model's hyper-parameters with 9-fold cross-validation on the training set. Once the optimal hyperparameters were determined, the model was retrained on the full training set to produce the final model. Subsequently, its performance was evaluated on the held-out test set from Clinical Dataset I to assess its generalization on unseen ABR data. Besides, the trained model was externally validated on the four independent datasets-Clinical Dataset II, Southampton Dataset, PhysioNet Dataset, and Mendeley Dataset-to test its cross-center generalizability. All the nine models underwent the same training and evaluation process. Each training session had 20 epochs and a batch size of 512. The Adam optimizer was utilized to adjust the learning rate, with initial values tuned for each model (see optimal hyperparameters in Table 1 in SDC 2). A random seed of 888 was set for reproducibility in the data split process. All experiments were executed on a Linux machine with an NVIDIA RTX-4090 Ti GPU, using Python (version 3.11.5) and Pytorch framework (version 2.1.1).

Performance Metrics: By comparison with the ground-truth label, true positive (TP), false positive (FP), false negative (FN), and true negative (TN) represent the number of correctly identified “RP” samples, incorrectly identified “RA” samples, incorrectly identified “RP” samples, and correctly identified “RA” samples, respectively. The model performance was assessed via accuracy, sensitivity (also called recall), specificity, F1-score. These were defined as follows:

Accuracy = \frac{TP + TN}{TP + FP + FN + TN}

Sensitivity = \frac{TP}{TP + FN}

Specificity = \frac{TN}{TN + FP}

Precision = \frac{TP}{TP + FP}

F1-score provides a harmonic mean of precision and sensitivity, and is calculated by:

F 1 - score = 2 \frac{Precision \times Sensitivity}{Precision + Sensitivity}

ROC curves are plots of sensitivity versus 1-specificity. Area under the ROC curve (AUC) was also calculated to evaluate the test performance. To compute 95% confidence interval (CI), the bootstrap resampling method with 1,000 bootstrap repetitions was used. Pairwise DeLong's tests (DeLong et al., 1988) were conducted to assess the statistical significance of AUC differences between models. Data were analyzed between April 16, 2023, and January 3, 2024.

Results

Performance Comparison of Models on Primary Clinical Dataset I

The performance of various DL models, including AlexNet, VGG, ResNet, Transformer, PatchTST, DiffTransformer, DiffPatchTST, ResTransformer, and ResPatchTST, was evaluated on Clinical Dataset I through both internal validation and hold-out validation. A summary of the results is provided in Figure 5.

Figure 5.

Performance of deep learning models on Clinical Dataset I. (a) Error bar plots for accuracy, sensitivity, specificity, F1-score, and area under the receiver operating characteristic curve (AUC) (left to right), evaluated using 9-fold cross-validation on the training set. (b) Forest plots for the same metrics evaluated on the held-out test set. Bold values indicate the best performance for each metric.

For internal validation (i.e., 9-fold cross-validation on the training set), ResPatchTST achieved the highest accuracy at 91.07% $\pm$ 0.31%, followed closely by ResTransformer, VGG, and ResNet, with accuracies above 90.83%, as shown in Figure 5(a). Sensitivity was highest for VGG and ResPatchTST, reaching 90.91% and 90.80%, respectively. ResPatchTST, ResTransformer, and ResNet demonstrated strong specificity at 91.59%, 91.52%, and 91.04%, respectively, indicating their effectiveness in identifying true negatives. Additionally, the F1-score was highest for ResPatchTST at 0.931, indicating balanced performance across precision and recall. All models exhibited high AUCs, with ResNet, VGG, ResTransformer, and ResPatchTST exceeding 0.970.

As shown in Figure 5(b), when evaluated on the held-out test set from Clinical Dataset I, ResPatchTST demonstrated the best performance across multiple metrics, achieving 91.90% accuracy (95% CI: 91.43%–92.40%), an AUC of 0.976 (95% CI: 0.973–0.978), and an F1-score of 0.919 (0.914–0.923). It also maintained balanced specificity (92.93%) and sensitivity (90.89%). ResTransformer, VGG, and ResNet, although slightly behind in accuracy, demonstrated strong generalization with AUCs above 0.974. These models maintained competitive F1-scores and showed robustness across both internal and hold-out validations on Clinical Dataset I.

To assess the statistical significance of performance differences among models, we conducted pairwise DeLong's tests on their AUCs. Significant differences (p < .05) were observed between most model pairs, except for ResNet versus VGG (p = .686), AlexNet versus DiffTransformer (p = .286), AlexNet versus Transformer (p = .231), PatchTST versus ResPatchTST (p = .450), and ResPatchTST versus ResTransformer (p = .350). Notably, many of these non-significant pairs share similar architectural structures. For example, ResNet and VGG are both CNN-based models. PatchTST and ResPatchTST share a common structural foundation in the PatchTST design; while ResPatchTST and ResTransformer are hybrid models that combine ResNet components with Transformer-based architectures.

To further validate the clinical utility of our models for automated ABR threshold determination, we converted waveform-level predictions into ear-level ABR threshold estimates using a rule-based aggregation strategy. Specifically, to address cases with inconsistent outputs at the same intensity level or false positives at lower intensities, the threshold was defined as the lowest intensity at which the majority of waveforms were classified as ABR-present. Figure 6 presents the distribution of individual prediction errors for the ResPatchTST. When applied to the held-out test set, 2,588 out of 3,052 ears (84.80%) were correctly predicted, and 2,912 out of 3,052 ears (95.41%) were predicted within 10 dB of expert-labeled thresholds.

Figure 6.

Distribution of absolute errors in auditory brainstem response (ABR) threshold prediction and corresponding cumulative accuracy for ResPatchTST. Each bar represents the number of ears at a given absolute prediction error level (in dB). Accuracy, $\leq$ 10 dB accuracy, and $\leq$ 20 dB accuracy are annotated in the upper right, with 95.41% of predictions falling within 10 dB of expert-labeled thresholds.

Performance Comparison of Models on Multicenter Datasets

Figure 7 summarizes the accuracy of various trained models evaluated on multicenter datasets: Clinical Dataset I, Clinical Dataset II, Southampton Dataset, PhysioNet Dataset, and Mendeley Dataset. Notably, the results for Clinical Dataset I were computed on a held-out test set that was completely independent of the training process.

Figure 7.

Forest plots for accuracies of various models externally evaluated on multicenter datasets: Clinical Dataset I, Clinical Dataset II, Southampton, PhysioNet, and Mendeley Datasets. Bold values indicate the best result for each dataset.

CNN-based models, including AlexNet, VGG, and ResNet, achieved strong performance on Clinical Dataset I (89.60%–91.47%) and Southampton Dataset (90.28%–95.83%), and performed adequately on Clinical Dataset II (87.84%–89.69%). However, their accuracies declined significantly on the Mendeley Dataset (80.00%–85.26%). Performance on the PhysioNet Dataset varied, with accuracies ranging from 87.50% to 93.75%.

Transformer-based models (e.g., Transformer, PatchTST, DiffTransformer, and DiffPatchTST) demonstrated superior generalization to external datasets despite not achieving the highest accuracy on Clinical Dataset I. Among these, PatchTST exhibited the most robust performance, with accuracies of 90.55% on Clinical Dataset I, 88.90% on Clinical Dataset II, 95.83% on Southampton Dataset, 92.31% on PhysioNet, and 89.47% on Mendeley Dataset. Transformer performed worse than PatchTST on Clinical Datasets I and II and the Southampton Dataset, but outperformed PatchTST on PhysioNet and achieved comparable results on the Mendeley Dataset. DiffTransformer excelled on the Mendeley Dataset, achieving the highest accuracy (90.53%) among all models.

Hybrid models, such as ResTransformer and ResPatchTST, combined CNN and Transformer architectures, showing improved performance on Clinical Datasets I and II compared to their individual components but struggling with generalization to the Mendeley Dataset. Specifically, ResPatchTST achieved the highest accuracy on Clinical Dataset I (91.90%) and demonstrated strong generalization to Clinical Dataset II (90.34%), Southampton Dataset (94.44%), and PhysioNet Dataset (92.31%). However, its performance on Mendeley Dataset was notably lower (81.05%), highlighting limitations in handling highly heterogeneous datasets. ResTransformer exhibited consistently lower performance than ResPatchTST across all datasets. Overall, the results emphasize the importance of both data diversity and architecture adaptability in achieving robust generalization across diverse clinical datasets.

Generalization Analysis on Data Size and Diversity

This section presents experiments analyzing the impact of training data volume and diversity on DL models’ generalization performance. PatchTST was selected for these experiments due to its robust generalization capabilities across multicenter datasets.

Generalization Analysis on Data Size: This subsection examines how variations in the training dataset size affect the performance of PatchTST. The model was retrained on subsets randomly sampled from the full training dataset (12,256 individuals) and validated within 9-fold cross-validation on the training set. Its performance was also evaluated on the held-out test set from Clinical Dataset I and an external dataset (Clinical Dataset II), with both test datasets remaining unchanged throughout the experiments. The training subsets varied in size, with individual counts of 1,000; 2,000; 4,000; 6,000; 8,000; 10,000, and the full set of 12,256 individuals.

Figure 8 illustrates how accuracy and AUC change with larger training set. Both metrics improved rapidly across all three testing methods as the training set size increased, converging beyond a knee point (typically around 8,000 individuals). Beyond this threshold, the growth in accuracy and AUC slowed down. These results indicate that larger training datasets enhance the model's performance, though the rate of improvement diminishes beyond certain thresholds.

Figure 8.

Impact of training dataset size on the generalization performance of PatchTST. The model is retrained on subsets randomly sampled from the full training dataset (12,256 individuals) and validated using 9-fold cross-validation on the training set, as well as on the held-out test set from Clinical Dataset I and the Independent Clinical Dataset II. (a) Accuracy; (b) AUC. Note. AUC = area under the receiver operating characteristic curve; PatchTST = patch time series transformer.

Generalization Analysis Across Age Groups: To assess the impact of age-restricted training data on model generalization, PatchTST was retrained using ABR data from a single age group in the training set and tested on unseen age groups from the held-out test set of Clinical Dataset I. The training set included four age groups: Infants (0–6 months; 1,897 individuals), children (6 months–18 years; 5,338 individuals), adults (18–60 years; 3,861 individuals), and the elderly (over 60 years; 1,160 individuals). The held-out test set comprised 163 infants, 777 children, 465 adults, and 152 elderly. For each age group, a model was trained using only the corresponding age-restricted data, while a second model was trained using a randomly sampled mixed-age subset of the same size for comparison. Both models were evaluated under the same protocol, with performance assessed across individual and combined age groups from the held-out test set. To determine whether the observed performance differences were statistically significant, we conducted pairwise DeLong's tests on the AUC values.

As shown in Figure 9, models trained on age-restricted data consistently exhibited reduced generalization to other age groups compared to their mixed-age counterparts. The infant-trained model achieved an overall accuracy of 83.54% and an AUC of 0.915 across unseen age groups, notably lower than the 87.34% accuracy and 0.954 AUC obtained with a mixed-age subset of equal size (p < .001 for the AUC values). Although it performed reasonably well on children (accuracy: 88.17%, AUC: 0.952), its performance dropped substantially on adults (accuracy: 78.83%, AUC: 0.866) and the elderly (accuracy: 73.97%, AUC: 0.823) compared to the mixed-age model, with all AUC differences being statistically significant (p < .001). The child-trained model generalized well to infants (accuracy: 92.46%, AUC: 0.978) and performed adequately on adults (accuracy: 85.95%, AUC: 0.940) and elderly groups (accuracy: 83.89%, AUC: 0.931), achieving an overall accuracy of 87.13% and AUC of 0.949—still consistently lower than the corresponding mixed-age model (p < .001), except for the comparison with the infant group (p = .283). The adult-trained model showed relatively stable generalization across age groups (accuracy: 86.13%–89.01%, AUC: 0.946–0.960) and even slightly outperformed the mixed-age model on elderly data. Nevertheless, its overall accuracy (88.53%) and AUC (0.955) remained lower than the mixed-age baseline (accuracy: 90.37%, AUC: 0.970, p < .001). In contrast, the elderly-trained model exhibited limited generalization, particularly to infants (accuracy: 80.27%; AUC: 0.908), with an overall accuracy of 83.90% and AUC of 0.918, much lower than the 87.64% accuracy and 0.952 AUC obtained with the mixed-age model (p < .001). These findings highlight the importance of diverse, age-varied datasets for training generalizable DL models.

Figure 9.

Generalization performance of PatchTST trained on age-restricted groups compared to mixed-age groups of equal size. The upper panel illustrates the model's performance when trained on data from a single age group and validated on unseen age groups both individually and collectively. The lower panel shows the corresponding performance metrics when the model is trained on a mixed-age dataset of equivalent size. (a) Accuracy; (b) AUC. In both panels, n represents the number of subjects. Statistical significance of AUC differences was evaluated using DeLong's test: *p < .05, **p < .01, ***p < .001. Note. AUC = area under the receiver operating characteristic curve; PatchTST = patch time series transformer.

Generalization Analysis Across Different Hearing Status: To evaluate how hearing status diversity affects model generalization, we retrained PatchTST separately on ABRs from individuals with either normal hearing (NH, PTA < 20 dB) or hearing loss (HL, PTA $\geq$ 20 dB) in the training set. Each hearing-status-restricted model was then tested on the opposite group from the held-out test set of Clinical Dataset I. For comparison, control models were trained on randomly sampled mixed-hearing-status subsets of the same size as each restricted group (n = 2,234 for NH and n = 4,821 for HL).

As shown in Figure 10, models trained on restricted hearing groups exhibited a decline in generalization when evaluated on unseen hearing categories. The NH-trained model achieved 84.98% accuracy and an AUC of 0.934 on HL samples—both lower than those of the matched mixed-hearing model (accuracy: 86.60%, AUC: 0.947), with the AUC difference being statistically significant (p < .001). A similar trend was observed in the reverse setting: the model trained on HL data yielded 89.08% accuracy and 0.958 AUC on NH samples, slightly below the mixed-hearing model (accuracy: 89.22%, AUC: 0.961; p < .05). Overall, the mixed-hearing models consistently outperformed those trained on hearing-status-restricted groups, suggesting that incorporating a wider range of hearing profiles—particularly those with greater waveform variability as seen in HL patients—helps the model better generalize across hearing populations.

Figure 10.

Generalization performance of PatchTST trained on hearing-status-restricted groups versus mixed-hearing-status groups of equal size. The upper panel shows performance when trained on either individuals with NH or HL and validated on the opposite group. The lower panel presents performance when trained on a mixed-hearing-status dataset of the same size. (a) Accuracy; (b) AUC. Statistical significance of AUC differences was evaluated using DeLong’s test: *p < .05, **p < .01, ***p < .001. Note. AUC = area under the receiver operating characteristic curve; PatchTST = patch time series transformer; NH = normal hearing; HL = hearing loss

Generalization Analysis on Acquisition Parameters and Patient Demographics

Table 2 presents the performance of the PatchTST model trained on ABR raw signals alone and in combination with acquisition parameters (e.g., stimulus rate, stimulus intensity, and epochs) or patient demographic features (e.g., age, and gender), evaluated on Clinical Datasets I and II. To integrate these additional variables, the deep features extracted by PatchTST from ABR waveforms were first compressed via a pooling layer, which were then concatenated with auxiliary variables and passed to the final classifier. The pairwise DeLong's test was applied to compare the AUCs of models with and without additional variables.

Table 2.

Performance of PatchTST Model Trained on ABR raw Signals Alone and in Combination with Acquisition Parameters or Patient Demographic Factors, Evaluated on Clinical Datasets I and II.

Additional variables	Accuracy—Clinical Dataset I	Accuracy—Clinical Dataset II	AUC–Clinical Dataset I	AUC—Clinical Dataset II
ABR raw signals only (baseline)	90.23 (89.72–90.77)	87.87 (86.80–88.93)	0.967 (0.964–0.970)	0.964 (0.958–0.969)
+ Age	90.37 ↑ (89.92–90.90)	87.89 ↑ (86.83–88.93)	0.969 ↑*** (0.966–0.971)	0.967 ↑** (0.962–0.972)
+ Gender	90.22 ↓ (89.74–90.77)	87.78 ↓ (86.77–88.82)	0.968 ↑* (0.965–0.970)	0.964 → (0.959–0.969)
+ Epochs	90.51 ↑ (90.04–91.04)	88.79 ↑ (87.84–89.83)	0.969 ↑** (0.966–0.972)	0.962 ↓ (0.956–0.967)
+ Stimulus rate	90.14 ↓ (89.64–90.66)	87.67 ↓ (86.60–88.79)	0.967 → (0.965–0.970)	0.963 ↓ (0.957–0.968)
+ Stimulus intensity	91.24 ↑ (90.75–91.74)	88.23 ↑ (87.19–89.24)	0.974 ↑*** (0.972–0.976)	0.962 ↓ (0.957–0.968)
+ Age & gender	90.22 ↓ (89.72–90.77)	87.33 ↓ (86.35–88.34)	0.968 ↑** (0.966–0.971)	0.964 → (0.958–0.969)
+ Stimulus intensity & epochs	90.90 ↑ (90.41–91.41)	89.35 ↑ (88.37–90.37)	0.971 ↑*** (0.968–0.973)	0.964 → (0.958–0.969)
+ Age, epochs & stimulus intensity	90.99 ↑ (90.51–91.51)	90.48 ↑ (89.49–91.41)	0.972 ↑*** (0.969–0.974)	0.966 ↑* (0.961–0.971)

Note. Results are reported as mean (95% confidence interval) for accuracy (%) and AUC. Arrows indicate performance change relative to the baseline model trained on raw ABR signals alone (↑ improved, ↓ decreased, → unchanged). Statistical significance of AUC differences was evaluated using DeLong's test: *p < .05, **p < .01, ***p < .001. ABR = auditory brainstem response; AUC = area under the receiver operating characteristic curve.

The baseline model, trained solely on raw ABR signals, achieved an accuracy of 90.23% and an AUC of 0.967 on Clinical Dataset I, and 87.87% accuracy with 0.964 AUC on the independent Clinical Dataset II. Adding individual variables such as age, epochs led to moderate performance improvements, with significant AUC gains on Clinical Dataset I (p < .001 for age, p < .01 for epochs) and, for age also on Clinical Dataset II (p < .01). Among all single-variable settings, stimulus intensity produced the greatest improvement on Clinical Dataset I—achieving 91.24% accuracy and 0.974 AUC on Clinical Dataset I (p < .001). In contrast, gender and stimulus rate did not result in consistent improvements across datasets, with gender yielding a marginal but significant AUC increase on Clinical Dataset I (p < .05), and stimulus rate showing no improvement. Feature combinations further boosted model performance. The integration of stimulus intensity and epochs yielded substantial gains in test accuracy, with the AUC on Clinical Dataset I significantly surpassing the baseline (p < .001). The best overall performance was achieved by combining age, epochs, and stimulus intensity, reaching 90.99% accuracy and 0.972 AUC on Clinical Dataset I (p < .001), and 90.48% accuracy and 0.966 AUC on Clinical Dataset II (p < .05). The performance gains—especially those observed on Clinical Dataset II—indicate that incorporating acquisition parameters and patient demographic variables effectively reduces cross-center distribution shifts, thereby enhancing the robustness and generalizability of DL-based ABR interpretation systems in clinical settings.

Discussion

While DL algorithms have proven useful in objectifying ABR detection in controlled settings (Liang et al., 2024; McKearney & MacKinnon, 2019; McKearney et al., 2022), their ability to generalize to real-world clinical data remains uncertain due to limitations such as small dataset sizes, insufficient data heterogeneity, and a lack of external validation. This study addresses these challenges by conducting the first multicenter validation of DL methods for ABR detection, utilizing a large and diverse clinical dataset comprising 13,813 participants across various age groups and degrees of hearing loss. In addition, our evaluation spans multiple datasets from different centers, each exhibiting considerable data heterogeneity due to variations in recording factors such as equipment, electrode placement, stimulus types, sampling rates, filtering ranges, and intensity levels, as well as individual characteristics like age, sex, hearing status, and differences in expert labeling (see details in Table 1). These variations across datasets pose challenges for model generalization (Eggermont et al., 1996; Hall, 2007; Zakaria et al., 2019) but also allow for a robust evaluation of the models’ generalizability in real clinical settings, as they reflect the inherent variability found in clinical practice.

The performance comparison of different DL models reveals their strengths and limitations in ABR detection. Hold-out validation on Clinical Dataset I, independent of the training process, provides an unbiased assessment of model performance. Our findings demonstrate the effectiveness of several DL models in detecting ABR (e.g., VGG, ResNet, PatchTST, DiffPatchTST), with hybrid models such as ResPatchTST and ResTransformer showing the strongest overall performance. These hybrid models combine the strengths of CNN for capturing local temporal information with Transformer architectures for modeling long-term temporal dependencies. Notably, ResPatchTST achieves the highest accuracy (91.90%), AUC (0.976), specificity (92.93%), sensitivity (90.89%), and F1-score (0.919), indicating its robustness across multiple metrics. Other models, such as ResNet, VGG, and ResTransformer, also perform competitively, maintaining accuracies above 91.43% and AUCs exceeding 0.974, with balanced F1-scores and specificity-sensitivity metrics.

Although the held-out test set was not used for model training, its similar distribution to the training set limits its capacity to fully reflect the challenges of data heterogeneity encountered in real clinical settings. Multicenter validation reveals significant performance variability across datasets, primarily driven by differences in demographics, equipment, and ABR acquisition parameters. CNN-based models achieve high accuracies on Clinical Dataset I (over 89.60%) and Southampton Dataset (90.28–95.83%), with the latter exceeding the interobserver agreement rates among audiologists (93.1–94.4%; see Table I-2 in SDC). However, their performance declines substantially on the Mendeley Dataset (80–85.26%), likely due to higher data variability, where inconsistent expert labeling is also observed, as reflected in the broader interobserver agreement range (87.37–98.95%; see Table III-2 in SDC). In contrast, Transformer-based models show superior generalization to external datasets. Among these, PatchTST maintains reasonable accuracies across all datasets without substantial drops, achieving 95.83% accuracy on the Southampton Dataset, surpassing expert labeling agreements. Notably, DiffTransformer attains 90.53% accuracy on the challenging Mendeley Dataset, exceeding or matching agreement rates in 7 out of 15 audiologist pairs (detailed in Table III-2 in SDC). This suggests that attention mechanisms in transformers may be better suited for capturing complex temporal dependencies and handling cross-center ABR signal variations. Hybrid models (e.g., ResPatchTST) consistently outperform others on Clinical Dataset I and generalize well to Clinical Dataset II, Southampton, and PhysioNet Datasets. However, their performance declines significantly on Mendeley, indicating potential limitations in handling small, highly heterogeneous datasets. This emphasizes the need for incorporating greater size and diversity in the training dataset to enhance the generalizability of hybrid models.

Another important finding on cross-center validation is that our models trained on click (broadband stimuli)-evoked ABRs demonstrated reasonable generalization to frequency-specific ABR data evoked by 1 and 4 kHz tone pips in the PhysioNet dataset, achieving accuracies of over 93.75% when using DiffTransformer, AlexNet, and Transformer models. This suggests that the models may capture universal features of ABR waveforms that are applicable across different stimulus types. While the current results are promising, further validation on datasets with diverse stimuli (e.g., tone bursts, chirps) is necessary to fully assess the models’ generalizability in frequency-specific diagnostic applications.

In addition to selecting an appropriate DL architecture for ABR detection, the model's generalization depends on the size and diversity of the training datasets. Our experiments highlight the critical role of both factors in enhancing the performance and robustness of DL models. Larger training datasets generally improve performance on unseen data, with the most significant gains observed when training on smaller datasets. However, the rate of improvement diminishes once a certain threshold is reached. These findings provide a valuable recommendation for researchers: Sufficient training data should be included to unlock the full potential of DL models for ABR detection. Additionally, training on datasets with diverse age groups and hearing statuses significantly enhances generalization, as models trained on mixed datasets outperform those trained on age- or hearing-status-restricted data. These emphasize the importance of including diverse training data that reflects the variety of conditions likely to be encountered in clinical practice. Together, these findings underscore that large and diverse training datasets are essential for developing DL models capable of handling real-world variability, offering valuable insights for building generalizable DL models for clinical applications.

Comparisons With Existing Approaches: Table 3 compares our work with prior studies (Alpsan 1991; Davey et al. 2007; McCullagh et al. 2007; Zhang et al., 2006). Unlike previous studies that relied on small experimental datasets, this study utilizes a larger and more diverse clinical dataset, enabling better generalization to real-world ABR recordings. Clinical ABRs are inherently more complex due to patient heterogeneity, making detection challenging. Despite this, our ResPatchTST model achieves 91.90% accuracy and an AUC of 0.976 on Clinical Dataset I, outperforming most traditional ML methods (Alpsan 1991; McCullagh et al. 2007). While Acır et al. (2006) reported 97.7% accuracy, their use of a small dataset and lack of external validation raise concerns about overfitting. For DL methods, we evaluated our models using the same public datasets (i.e., PhysioNet Dataset and Southampton Dataset) as McKearney & MacKinnon (2019) and McKearney et al. (2022). The difference is that their study utilized hold-out validation, while we trained our models on Clinical Dataset I and conducted independent evaluations on these public datasets. Our Transformer model achieves 94.71% accuracy and an AUC of 0.971 on PhysioNet, outperforming McKearney & MacKinnon's CNN (accuracy: 92.9%, AUC: 0.946). Similarly, our PatchTST model on Southampton Dataset reached an AUC of 0.981, slightly exceeding McKearney et al.'s ensemble model (AUC: 0.975). Additionally, our approach surpasses the Wide & Deep model (Liang et al., 2024, accuracy: 91.0%), despite their inclusion of demographic factors. These results demonstrate the improved ABR detection capabilities of our models compared to existing approaches.

Table 3.

Summary of ABR Detection Studies for the Purpose of Threshold Estimation.

	Study	Dataset	Subjects	Methods		Results
	Study	Dataset	Subjects	Features	Algorithms	Accuracy, %	AUC
ML	Alpsan (1991)	Private	8	Raw waveforms	ANN	74.9	–
	Acır et al. (2006)	Private	24	amplitudes, DCT coefficients, DWT coefficients	SVM	97.7	–
	Zhang et al. (2006)	Private	8	DWT	BN	84.17	–
	Davey et al. (2007)	Private	85	Time, frequency, Cross-correlation Measures.	ANN, DT	95.6 (strong responses); 85 (weak responses)	–
	Rahbar et al. (2007)	Private	81	DWT	MLP	92.4 (response); 77.06 (No response); 80.47 (No peak V)	–
	McCullagh et al. (2007)	Private	85	DWT	NB, SVM, ANN, KStar	83.4	–
DL	McKearney & MacKinnon (2019)	PhysioNet	8	Raw waveforms	CNN	92.9	0.946
	McKearney et al. (2022)	Southampton	15	Raw waveforms	Stacked ensemble combining CNN-LSTM and RF	–	0.975
	Liang et al. (2024)	Private	100	Raw waveforms, age, sex, pure-tone thresholds etc.	Wide & Deep model	91.0	–
	Ours	Private-Clinical Dataset I	13,813	Raw waveforms	ResPatchTST	91.90	0.976
		PhysioNet	8	Raw waveforms	Transformer	94.71	0.971
		Southampton	15	Raw waveforms	PatchTST	95.83	0.981

Note. “–” indicates information not reported in that study. ABR = auditory brainstem response; ML = machine learning; DL = deep learning; DCT = discrete cosine transform; DWT = discrete wavelet transform; ANN = artificial neural network; SVM = support vector machine; AUC = area under the receiver operating characteristic curve; NB = naive Bayes; CNN-LSTM = convolutional long short-term memory; RF = random forest; BN = batch normalization.

Implications for Clinical Applications: Interpretation of ABRs in clinical practice often relies on visual inspection by audiologists, a process prone to variability between practitioners (Vidler & Parker, 2004). This study develops DL models capable of detecting the presence of ABR on a waveform-by-waveform basis, offering outputs that can enhance clinical workflows and patient outcomes. The following illustrates how such a DL-based ABR detection model can be integrated into the clinical workflow. A key step in determining the ABR threshold is to detect whether an ABR is present or not at various sound intensities. In clinical practice, one ear is typically tested by starting with high-intensity stimuli and decreasing incrementally until no ABR is detected. The threshold can be then determined as the lowest intensity at which the DL model detects an ABR. In cases where no response is detected, the model can instruct the audiologists to promptly terminate the test, thereby saving time and improving efficiency in clinical assessments. This capability makes the proposed DL models particularly suitable for real-time waveform detection in determining a patient's ABR threshold, as they eliminate the need for exhaustive testing at multiple predefined intensities for each ear.

The clinical relevance of our proposed model is further supported by its strong performance in ABR threshold estimation. As shown in Figure 6, ResPatchTST demonstrates high consistency with expert-labeled thresholds, with over 95% of predictions falling within a 10 dB error margin. Notably, ABR threshold determination in clinical practice inherently involves a degree of ambiguity—some definitions allow for thresholds to be set at intensities yielding “inconclusive responses” (McKearney & MacKinnon, 2019). In this context, the model's ability to maintain such a high level of agreement underscores its robustness at near-threshold levels and demonstrates its potential for reliable clinical deployment.

By automating this process, the models can help standardize performance among experts with varying experience levels and reduce workloads for clinicians handling high volumes of ABR recordings, thereby facilitating better estimation of hearing thresholds across diverse clinical settings. Moreover, these models hold significant potential for reducing healthcare disparities in remote and underserved areas, where access to audiological expertise may be limited or unavailable.

Limitations

This study has two primary limitations. First, the proposed DL models detect the ABR based on single waveforms, offering potential for real-time detection as discussed previously. However, this approach neglects the sequential relationships between waveforms, which are crucial for accurate ABR detection, especially at lower stimulus intensities where such contextual information becomes critical. Incorporating interwaveform correlations likely improves detection performance but would necessitate predefined testing intensities, increasing the complexity and duration of the process. Future research should aim to balance improved detection accuracy with the need for real-time processing and efficient testing workflows, exploring methods to optimize this trade-off.

Second, although several DL models have demonstrate reasonable generalization across multicenter datasets, the variability in data acquisition conditions and patient characteristics remains a significant challenge. Differences in equipment, recording protocols, and patient demographics can affect model performance, emphasizing the need for a more robust foundational model that explicitly accounts for such variability. To address this, future efforts should focus on training with more heterogeneous datasets that reflect the diverse clinical scenarios encountered in practice. This approach would enhance the models’ adaptability and reliability across a broader range of clinical settings.

Conclusions

This study represents the first multicenter validation of DL methods for automated ABR detection. By utilizing large and diverse multicenter datasets, we have demonstrated the effectiveness of various DL models, particularly Transformer-based architectures, in handling the complexities and variability of real-world clinical data. The results highlight the importance of training on large, heterogeneous datasets covering diverse ages, hearing statuses, and acquisition conditions to improve model generalization. The proposed DL models show promise for streamlining ABR detection, improving testing efficiency, and reducing interobserver variability, which may aid ABR interpretation for estimating hearing thresholds in a clinical context. Future work should focus on integrating sequential waveform relationships and further addressing dataset variability to optimize these models for broader clinical applications.

Supplemental Material

sj-docx-1-tia-10.1177_23312165251347773 - Supplemental material for Comparison of Deep Learning Models for Objective Auditory Brainstem Response Detection: A Multicenter Validation Study

Supplemental material, sj-docx-1-tia-10.1177_23312165251347773 for Comparison of Deep Learning Models for Objective Auditory Brainstem Response Detection: A Multicenter Validation Study by Yin Liu, Lingjie Xiang, Qiang Li, Kangkang Li, Yihan Yang, Tiantian Wang, Yuting Qin, Xinxing Fu, Yu Zhao and Chenqiang Gao in Trends in Hearing

Footnotes

Acknowledgments

The authors thank the Department of Otorhinolaryngology & Hearing and Speech Rehabilitation, the First Affiliated Hospital of Chongqing Medical University for providing the Clinical Dataset II.

Ethical Considerations

This study was approved by the Human Research Ethics Committee of Beijing Tongren Hospital, Capital Medical University (TRECKY2019-090).

Consent to Participate

Informed consent was waived.

Author Contribution(s)

Yin Liu and Xinxing Fu designed and performed experiments; Kangkang Li and Yihan Yang collected data; Yin Liu, Lingjie Xiang, Qiang Li, Tiantian Wang, and Yuting Qin analyzed data; Yin Liu, Lingjie Xiang, and Yu Zhao provided statistical analysis; Yin Liu wrote the paper; Xinxing Fu and Chenqiang Gao provided critical revision. All authors approved the submitted version.

Funding

The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by National Key Research and Development Program of China, National Natural Science Foundation of China, Natural Science Foundation of Chongqing, China, Science and Technology Research Program of Chongqing Municipal Education Commission (grant number 2022YFA1004100, 62301096,CSTB2023NSCQ-MSX0659, cstc2021jcyj-bshX0206, KJQN202400632).

Declaration of Conflicting Interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Data Availability Statements

Clinical Datasets I and II are not publicly available due to privacy concerns regarding the subjects. Southampton Dataset is available at http://doi.org/10.5258/SOTON/D0168. PhysioNet Dataset is available at https://physionet.org/content/earndb/1.0.0/. Mendeley Dataset is available at https://data.mendeley.com/datasets/4yb9772dff/1.

Supplemental material

Supplemental material for this article is available online.

References

Acır

Özdamar

Güzeliş

(2006). Automatic classification of auditory brainstem responses using SVM-based feature selection algorithm for threshold detection. Engineering Applications of Artificial Intelligence, 19(2), 209–218. httpsss://doi.org/http://doi.org/10.1016/j.engappai.2005.08.004

Aloufi

Heinrich

Marshall

Kluk

(2023). Sex differences and the effect of female sex hormones on auditory function: A systematic review. Frontiers in Human Neuroscience, 17, 1077409. https://doi.org/10.3389/fnhum.2023.1077409

Alpsan

(1991). Classification of auditory brainstem responses by human experts and backipropagation neural networks. In Paper presented at the Proceedings of the Annual International Conference of the IEEE Engineering in Medicine and Biology Society 13: 1991.

Arnold

S. A.

(1985). Objective versus visual detection of the auditory brain stem response. Ear and Hearing, 6(3), 144–150. https://doi.org/10.1097/00003446-198505000-00004

Ballachanda

B. B.

Moushegian

Stillman

R. D.

(1992). Adaptation of the auditory brainstem response: Effects of click intensity, polarity, and position. Journal of the American Academy of Audiology, 3(4), 275–282.

Berninger

Olofsson

Leijon

(2014). Analysis of click-evoked auditory brainstem responses using time domain cross-correlations between interleaved responses. Ear and Hearing, 35(3), 318–329. https://doi.org/10.1097/01.aud.0000441035.40169.f2

Boettcher

F. A.

(2002). Presbyacusis and the auditory brainstem response. Journal of Speech, Language, and Hearing Research, 45(6), 1249–1261. https://doi.org/10.1044/1092-4388(2002/100)

Chesnaye

M. A.

Bell

S. L.

Harte

J. M.

Simpson

D. M.

(2018). Objective measures for detecting the auditory brainstem response: Comparisons of specificity, sensitivity and detection time. International Journal of Audiology, 57(6), 468–478. https://doi.org/10.1080/14992027.2018.1447697

Chesnaye

M. A.

Bell

S. L.

Harte

J. M.

Simpson

D. M.

(2019). A group sequential test for ABR detection. International Journal of Audiology, 58(10), 618–627. https://doi.org/10.1080/14992027.2019.1625486

10.

Davey

McCullagh

Lightbody

McAllister

(2007). Auditory brainstem response classification: A hybrid model using time and frequency features. Artificial Intelligence in Medicine, 40(1), 1–14. https://doi.org/10.1016/j.artmed.2006.07.001

11.

Davey

R. T.

McCullagh

P. J.

McAllister

H. G.

Houston

H. G.

(2003). Modeling of the brainstem evoked response for objective automated interpretation. In Paper presented at the 4th International IEEE EMBS Special Topic Conference on Information Technology Applications in Biomedicine, 2003.

12.

Dehan

C. P.

Jerger

(1990). Analysis of gender differences in the auditory brainstem response. The Laeyngoscope, 100(1), 18–24. https://doi.org/10.1288/00005537-199001000-00005

13.

DeLong

E. R.

DeLong

D. M.

Clarke-Pearson

D. L.

(1988). Comparing the areas under two or more correlated receiver operating characteristic curves: A nonparametric approach. Biometrics, 44(3), 837–845. https://doi.org/10.2307/2531595

14.

Dobrowolski

Suchocki

Tomczykiewicz

Majda-Zdancewicz

(2016). Classification of auditory brainstem response using wavelet decomposition and SVM network. Biocybernetics and Biomedical Engineering, 36(2), 427–436. https://doi.org/10.1016/j.bbe.2016.01.003

15.

Eggermont

J. J.

Brown

D. K.

Ponton

C. W.

Kimberley

B. P.

(1996). Comparison of distortion product otoacoustic emission (DPOAE) and auditory brain stem response (ABR) traveling wave delay measurements suggests frequency-specific synapse maturation. Ear and Hearing, 17(5), 386–394. https://doi.org/10.1097/00003446-199610000-00004

16.

Elberling

(1979). Auditory electrophysiology: The use of templates and cross correlation functions in the analysis of brain stem potentials. Scandinavian Audiology, 8(3), 187–190. https://doi.org/10.3109/01050397909076320

17.

Elberling

Don

(1984). Quality estimation of averaged auditory brainstem responses. Scandinavian Audiology, 13(3), 187–197. https://doi.org/10.3109/01050398409043059

18.

Frisina

R. D.

Ding

Zhu

Walton

J. P.

(2016). Age-related hearing loss: Prevention of threshold declines, cell loss and apoptosis in spiral ganglion neurons. Aging (Albany NY), 8(9), 2081. https://doi.org/10.18632/aging.101045

19.

Grose

J. H.

Buss

Elmore

(2019). Age-related changes in the auditory brainstem response and suprathreshold processing of temporal and spectral modulation. Trends in Hearing, 23, 1534115969. https://doi.org/10.1177/2331216519839615

20.

Hall

J. W.

(2007). New handbook of auditory evoked responses.

21.

Zhang

Ren

Sun

(2016). Deep residual learning for image recognition. In Paper presented at the Proceedings of the IEEE conference on Computer Vision and Pattern Recognition.

22.

Jewett

D. L.

Williston

J. S.

(1971). Auditory-evoked far fields averaged from the scalp of humans. Brain (London, England : 1878), 94(4), 681–696. http://doi.org/10.1093/brain/94.4.681

23.

Krizhevsky

Sutskever

Hinton

G. E.

(2012). Imagenet classification with deep convolutional neural networks. Advances in Neural Information Processing Systems, 60(6), 25. https://doi.org/10.1145/3065386

24.

Liang

Liu

Liang

Guo

Liu

Gao

(2024). Automatic recognition of auditory brainstem response waveforms using a deep learning-based framework. Otolaryngology–Head and Neck Surgery, 171(4), 1165–1171. https://doi.org/10.1002/ohn.840

25.

Lightfoot

Brennan

FitzGerald

Ferm

(2019). Recommended procedure, auditory brainstem response (ABR) testing in babies. The British Society of Audiology, 1–12.

26.

Simpson

D. M.

Bell

S. L.

(2007). Objective detection of evoked potentials using a bootstrap technique. Medical Engineering & Physics, 29(2), 191–198. https://doi.org/10.1016/j.medengphy.2006.03.001

27.

McCullagh

Wang

Zheng

Lightbody

McAllister

(2007). A comparison of supervised classification methods for auditory brainstem response determination [comparative study; evaluation study; journal article]. Studies in Health Technology and Informatics, 129(Pt 2), 1289–1293. http://doi.org/10.3233/978-1-58603-774-1-1289

28.

McKearney

R. M.

(2023). Improving objective analysis of the auditory brainstem response. University of Southampton.

29.

McKearney

R. M.

Bell

S. L.

Chesnaye

M. A.

Simpson

D. M.

(2022). Auditory brainstem response detection using machine learning: A comparison with statistical detection methods. Ear & Hearing, 43(3), 949–960. https://doi.org/10.1097/AUD.0000000000001151

30.

McKearney

R. M.

MacKinnon

R. C.

(2019). Objective auditory brainstem response classification using machine learning [journal article]. International Journal of Audiology, 58(4), 224–230. https://doi.org/10.1080/14992027.2018.1551633

31.

Moore

J. K.

Perazzo

L. M.

Braun

(1995). Time course of axonal myelination in the human brainstem auditory pathway. Hearing Research, 87(1-2), 21–31. https://doi.org/10.1016/0378-5955(95)00073-D

32.

Nie

Nguyen

N. H.

Sinthong

Kalagnanam

(2022). A time series is worth 64 words: Long-term forecasting with transformers. arXiv preprint arXiv:2211.14730

33.

Organization

W. H.

(2021). Hearing screening: Considerations for implementation. World Health Organization.

34.

Pratt

T. L.

Olsen

W. O.

Bauch

C. D.

(1995). Four-channel ABR recordings: Consistency in interpretation. American Journal of Audiology, 4(2), 47–54. https://doi.org/10.1044/1059-0889.0402.47

35.

Rahbar

Abolhassani

M. D.

Arabalibeik

Jafari

A. H.

(2007). Auditory brainstem response classification using wavelet transform and multilayer feed-forward networks. Paper presented at the 2007 4th IEEE/EMBS International Summer School and Symposium on Medical Devices and Biosensors.

36.

Ren

Zeng

Zhao

(2016). Intra-operative hearing monitoring methods in middle ear surgeries. Journal of Otology, 11(4), 178–184. https://doi.org/10.1016/j.joto.2016.12.003

37.

Silva

Epstein

(2010). Estimating loudness growth from tone-burst evoked responses. The Journal of the Acoustical Society of America, 127(6), 3629–3642. https://doi.org/10.1121/1.3397457

38.

Simonyan

(2014). Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556

39.

Stueve

M. P.

O'Rourke

(2003). Estimation of hearing loss in children: Comparison of auditory steady-state response, auditory brainstem response, and behavioral test methods [journal article]. American Journal of Audiology, 12(2), 125–136. https://doi.org/10.1044/1059-0889(2003/020)

40.

Stürzebecher

Cebulla

Wernecke

(1999). Objective response detection in the frequency domain: Comparison of several q-sample tests. Audiology and Neurotology, 4(1), 2–11. https://doi.org/10.1159/000013815

41.

Suthakar

Liberman

M. C.

(2019). A simple algorithm for objective threshold determination of auditory brainstem responses. Hearing Research, 381(15), 107782. https://doi.org/10.1016/j.heares.2019.107782

42.

Sutton

Lightfoot

Stevens

Booth

Brennan

Feirn

Meredith

(2013). Guidance for auditory brainstem response testing in babies. British Society of Audiology. https://www. thebsa. org. uk/wp-content/uploads/2014/08/NHSP_ABRneonate_2014. pdf

43.

Tanaka

Ohara

Matsuzaka

Matsugaki

Ishimoto

Ozasa

Kuroda

Matsuo

Nakano

(2023). Quantitative threshold determination of auditory brainstem responses in mouse models. International Journal of Molecular Sciences, 24(14), 11393. https://doi.org/10.3390/ijms241411393

44.

Tsutsui

Ohno

Symon

Wang

(1986). Combined measurement of brainstem auditory and somatosensory evoked potentials in a surgically treated brainstem hematoma. Surgical Neurology, 25(6), 575–581. https://doi.org/10.1016/0090-3019(86)90186-2

45.

Valderrama

J. T.

Alvarez

de la Torre

Carlos Segura

Sainz

Luis Vargas

(2012). Recording of auditory brainstem response at high stimulation rates using randomized stimulation and averaging. The Journal of the Acoustical Society of America, 132(6), 3856–3865. https://doi.org/10.1121/1.4764511

46.

Vaswani

(2017). Attention is all you need. Advances in Neural Information Processing Systems, 30. https://doi.org/10.48550/arXiv.1706.03762

47.

Vidler

Parker

(2004). Auditory brainstem response threshold estimation: Subjective threshold estimation by experienced clinicians in a computer simulation of the clinical test. International Journal of Audiology, 43(7), 417–429. https://doi.org/10.1080/14992020400050053

48.

Wang

Han

Sheng

Zhou

Wang

Huang

Song

(2021). Real-time threshold determination of auditory brainstem responses by cross-correlation analysis. Iscience, 24(11), 19. https://doi.org/10.1016/j.isci.2021.103285

49.

Weber

B. A.

Fletcher

G. L.

(1980). A computerized scoring procedure for auditory brainstem response audiometry. Ear and Hearing, 1(5), 233–236. https://doi.org/10.1097/00003446-198009000-00001

50.

Wimalarathna

Ankmnal-Veeranna

Allan

Agrawal

S. K.

Samarabandu

Ladak

H. M.

Allen

(2022). Machine learning approaches used to analyze auditory evoked responses from the human auditory brainstem: A systematic review. Computer Methods and Programs in Biomedicine, 226, 107118. https://doi.org/10.1016/j.cmpb.2022.107118

51.

Dong

Xia

Sun

Zhu

Huang

Wei

(2024). Differential transformer. arXiv preprint arXiv:2410.05258.

52.

Zaitoun

Cumming

Purcell

(2014). Review: Inter and intra-reader agreement among audiologists in Reading auditory brainstem response waves. Canadian Journal of Speech-Language Pathology and Audiology, 38(4), 440–449.

53.

Zakaria

M. N.

Wahab

N. A. A.

Maamor

Jalaei

Dzulkarnain

A. A. A.

(2019). Auditory brainstem response (ABR) findings in males and females with comparable head sizes at supra-threshold and threshold levels. Neurology, Psychiatry and Brain Research, 32, 4–7. https://doi.org/10.1016/j.npbr.2019.03.001

54.

Zhang

McAllister

Scotney

McClean

Houston

(2006). Combining wavelet analysis and Bayesian networks for the classification of auditory brainstem response. IEEE Transactions on Information Technology in Biomedicine, 10(3), 458–467. https://doi.org/10.1109/TITB.2005.863865

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.06 MB