Abstract
Speaker-conditioned target speaker extraction algorithms aim at extracting the target speaker from a mixture of multiple speakers by using additional information about the target speaker. Previous studies have evaluated the performance of these algorithms using either instrumental measures or subjective assessments with normal-hearing listeners or with hearing-impaired listeners. Notably, a previous study employing a quasicausal algorithm reported significant intelligibility improvements for both normal-hearing and hearing-impaired listeners, while another study demonstrated that a fully causal algorithm could enhance speech intelligibility and reduce listening effort for normal-hearing listeners. Building on these findings, this study focuses on an in-depth subjective assessment of two fully causal deep neural network-based speaker-conditioned target speaker extraction algorithms with hearing-impaired listeners, both without hearing loss compensation (unaided) and with linear hearing loss compensation (aided). Three different subjective performance measurement methods were used to cover a broad range of listening conditions, namely paired comparison, speech recognition thresholds, and categorically scaled perceived listening effort. The subjective evaluation results with 15 hearing-impaired listeners showed that one algorithm significantly reduced listening effort and improved intelligibility compared to unprocessed stimuli and the other algorithm. The data also suggest that hearing-impaired listeners experience a greater benefit in terms of listening effort (for both male and female interfering speakers) and speech recognition thresholds, especially in the presence of female interfering speakers than normal-hearing listeners, and that hearing loss compensation (linear amplification) is not required to obtain an algorithm benefit.
Keywords
Introduction
The cocktail-party problem (Bronkhorst, 2015; Cherry, 1953) exemplifies a complex acoustic scenario in which individuals attempt to follow the conversation of a target speaker in the presence of multiple interfering speakers and background noise. Understanding the target speaker in such a multitalker scenario requires significantly more cognitive effort compared to a quiet setting. Even normal-hearing (NH) listeners often struggle to fully understand the target speaker under these conditions (Brungart et al., 2009; Kidd Jr et al., 2016). This becomes even more challenging for hearing-impaired (HI) listeners (Bacon et al., 1998; Reinten et al., 2021) due to peripheral hearing deficits that impair selective attention (Shinn-Cunningham & Best, 2008). Although the mechanisms underlying selective attention are not yet fully understood, one of the primary goals of speech processing research is to develop algorithms that can mimic these abilities and effectively extract the target speaker from a mixture. Such algorithms have significant potential for various real-world applications, including hearing aids and other assistive listening devices, such as smart earbuds and hearables that enhance conversational clarity in everyday environments, or remote microphones that transmit a target speaker in a classroom scenario.
With recent advancements in deep learning (Miikkulainen et al., 2024; Schmidhuber, 2015), this study focuses on deep neural network-based speaker extraction algorithms, commonly referred to as speaker-conditioned target speaker extraction (SC-TSE). In general, SC-TSE algorithms aim to directly extract the target speaker from the mixture utilizing auxiliary information about the target speaker (Žmolíková et al., 2023) (see detailed overview in the next section). Several SC-TSE algorithms have shown impressive performance when evaluated in terms of speech quality and intelligibility using commonly used objective measures, such as scale-invariant signal-to-distortion ratio (SI-SDR) (Le Roux et al., 2019), perceptual evaluation of speech quality (PESQ) (ITU-T, 2001), short-time objective intelligibility (STOI) (Taal et al., 2011), and word error rate (Wang & Chelba, 2003). However, despite their potential, there is a lack of studies investigating the benefits of the SC-TSE algorithms for HI listeners. This study aims to address this gap.
Recently, the performance of SC-TSE algorithms was subjectively evaluated by Sinha et al. (2023) and Thoidis and Goehring (2024). According to Thoidis and Goehring (2024), the evaluation was conducted using double-blind sentence recognition tests with both NH and HI listeners for a quasicausal SC-TSE algorithm for mixtures of one, two, and three speakers in the restaurant noise, comparing the performance against speech enhancement algorithms (without using auxiliary information about the target speaker). The results demonstrated that both NH and HI listeners benefited from the SC-TSE algorithm, with HI listeners experiencing a greater improvement compared to NH listeners. Despite these promising findings, the study’s scope had notable limitations. Specifically, the language of the target speaker differed from that of the interfering speaker(s), and the signal-to-noise ratio (SNR) between target and interfering speaker(s) was fixed at
In this study, we systematically investigate the benefits of SC-TSE algorithms for HI listeners using three different evaluation methods as used by Sinha et al. (2023) for the same language of target and interfering speakers, and for a broader range of SNRs, especially since different evaluation methods differ with respect to the applicable range of SNRs. Speech recognition tests for NH listeners are typically conducted at negative SNRs because ceiling performance is already reached below 0 dB SNR. However, in such challenging conditions, SC-TSE algorithms often struggle, as separating the target speaker from interfering speaker(s) becomes more difficult. At higher SNRs, where these algorithms tend to perform better, other measures like listening effort are better suited to evaluate speech perception. For HI listeners, speech-on-speech masking may result in SRTs of Research Question 1: Do SC-TSE algorithms provide comparable or greater benefits for HI listeners compared to NH listeners, also at low SNRs (SNRs Research Question 2: Does hearing loss compensation enhance the benefits of SC-TSE processing for HI listeners, or do listeners without hearing loss compensation experience similar benefits?
To answer these questions, we evaluated the potential of two SC-TSE algorithms to enhance the speech perception of the target speaker across a broad range of SNRs. The evaluations covered various acoustic conditions, including scenarios with one or two interfering speakers and with or without gender differences between the target speaker and the interfering speaker(s). Measurements were conducted both with (aided) and without (unaided) hearing loss compensation to assess the impact of compensation. Furthermore, we also compared the performance of the SC-TSE algorithms with HI listeners to the previous study performed with NH listeners (Sinha et al., 2023).
Target Speaker Extraction Algorithms
Target speaker extraction is closely related to both speech enhancement and blind source separation (BSS). While speech enhancement and target speaker extraction both aim to suppress undesired sources, BSS aims to estimate all individual sources from a mixture. The main distinction lies in their objectives: BSS separates all sources, whereas target speaker extraction isolates only the speaker of interest. Compared to speech enhancement, which typically addresses nonspeech interfering sources, target speaker extraction deals with interference from overlapping speech, making it especially relevant in multitalker scenarios.
Target speaker extraction aims at extracting the speaker of interest from a mixture of multiple speakers. One approach to achieve this is by first utilizing a BSS technique (Vincent et al., 2018) to separate all individual sources from the mixture and then select the target speaker utilizing a speaker selection module (Sinha et al., 2024). However, BSS techniques typically require the number of sources in the mixture to be known or estimated, which is not trivial in practice. Another approach involves using speech enhancement algorithms trained to enhance the dominant speaker in the mixture (Thoidis & Goehring, 2024). However, these algorithms often fail to generalize when the interfering speakers are at an equal or higher level than the target speaker, an issue frequently encountered in real-world scenarios.
An alternative approach is to use an SC-TSE algorithm (Žmolíková et al., 2023), which aims at directly extracting the target speaker from the mixture. SC-TSE algorithms typically require auxiliary information about the target speaker. The most commonly used types of auxiliary information include reference speech (Sinha et al., 2024; Wang et al., 2019; Xu et al., 2020; Žmolíková et al., 2019), visual cues (Ephrat et al., 2018), speech activity (Delcroix et al., 2021), or directional information (Brendel et al., 2020; Gu et al., 2019) about the target speaker.
In this study, we consider two different SC-TSE algorithms that use the reference speech of the target speaker as auxiliary information (Delcroix et al., 2020; Ge et al., 2020; Sinha et al., 2024, 2022; Wang et al., 2019; Xu et al., 2020). The reference speech is a prerecorded utterance of the target speaker which is different from the utterance of the target speaker used in the mixture. Figure 1 depicts the block diagram of a typical SC-TSE algorithm, consisting of two networks: a speaker embedder network and a speaker separator network. The goal of the speaker embedder network is to generate a speaker embedding from the reference speech of the target speaker. The target speaker embedding represents condensed speech features of the target speaker which guides the separator network toward estimating the target speaker from the mixture.

Block Diagram of Speaker-Conditioned Target Speaker Extraction (SC-TSE) Algorithm.
For this study, we used the same algorithms (Algo-1 and Algo-2) previously evaluated by Sinha et al. (2023) with NH listeners. It should be noted that the primary aim of this study is not to compare the algorithms, but to investigate whether the performance trends observed in objective evaluation metrics align with the subjective outcomes in HI listeners, as was similarly investigated for NH listeners (Sinha et al., 2023). Inspired by Wang et al. (2019), Algo-1 employs separate training of the embedder and separator networks and estimates a real-valued mask in the short-time Fourier transform (STFT) domain to perform the speaker extraction. It should be noted that due to the use of a real-valued mask, Algo-1 does not estimate the phase of the target speaker in the STFT domain but uses the phase of the mixture. Algo-1 uses low complexity ResNet-gated recurrent units (ResNet-GRU) in the separator network and a pretrained long short-term memory network in the embedder network. For Algo-1, the total computational complexity in terms of number of multiplications and additions (MACs) is
Both algorithms were trained on the same dataset for mixtures of two speakers, mixtures of three speakers, and noisy mixtures of two speakers at a sampling rate of
It should be noted that both algorithms were trained and validated using mixtures composed of English speech, while the subjective evaluations were conducted using German speech materials. Algo-1 was trained using the scale-invariant signal-to-distortion ratio (SI-SDR) loss (Luo & Mesgarani, 2019), while Algo-2 used a weighted combination of multiscale SI-SDR loss for speaker separator and cross-entropy loss for speaker embedder network, as by Ge et al. (2020). Both algorithms were optimized with the ADAM optimizer for up to 150 epochs with early stopping.
Participants and Stimuli
Participants
Fifteen native German-speaking HI listeners (

Individual and Group Mean Hearing Thresholds (in dB) for the Right and Left Ears for the Participants.
Stimuli and Equipment
The same stimuli as by Sinha et al. (2023) were used in this study. The target speaker stimuli consisted of German matrix sentences uttered by a fixed male speaker from the Oldenburg sentence test (OLSA) (Wagener et al., 1999). The reference speech of the target speaker was chosen from the German Göttingen sentence test (GÖSA) (Kollmeier & Wesselkamp, 1997) which consists of everyday sentences uttered by the same male speaker. Each OLSA sentence followed a fixed syntactical structure containing five words in the following order: name, verb, numeral, adjective, and object. For each word,
To familiarize the participants with the voice of the target speaker, each participant listened to an example of about 60
Subjective Evaluation Methods and Procedures
To assess the performance of both considered SC-TSE algorithms, paired comparisons (Parizet, 2002), speech intelligibility measurements (Wagener et al., 1999), and perceived listening effort scaling (Rennies et al., 2014) were utilized. These methods vary in terms of the outcome measure and the SNR range to which they are applicable. Paired comparisons were used to determine the preferences of participants between different versions of the same stimulus. An SNR =
In this study, we used either one or two interfering speakers in the unprocessed stimuli, depending on the specific evaluation method applied. Paired comparisons and perceived listening effort were measured for stimuli in which the target speaker was masked by either one or two interfering speakers, while SRTs were measured only for stimuli having two interfering speakers. We excluded SRT measurements with only one interfering speaker, as SRTs for such conditions are known to be extremely low (Rhebergen et al., 2005), that is, falling into SNR regions where algorithms are not expected to work well, nor where typical listening conditions would occur (Smeds et al., 2015). For all methods the signals were initially scaled such that the target speaker had the same level (65 dB SPL before hearing loss compensation) as the single or the combined interfering speaker(s), and then the target speaker was adapted to generate stimuli at different SNRs. In the processed conditions, these mixtures were processed by either Algo-1 or Algo-2, typically reducing the level of the interfering speaker(s) energy. In the aided conditions, hearing loss compensation was applied afterwards.
Each evaluation was conducted for three processing conditions (unprocessed, Algo-1, and Algo-2), two genders of interfering speakers (male and female), and different number of interfering speakers (depending on the evaluation method). In the paired comparison, all combinations of three pairwise processing conditions, one and two interfering speakers, and two genders were considered, resulting in 12 unique conditions. Each condition was repeated three times, leading to 36 trials per participant for both unaided and aided. For the SRT measurements, only two interfering speakers were considered, combined with three processing conditions and two genders, resulting in six unique conditions and approximately
All measurements were performed for both unaided and aided conditions. For the unaided condition, no hearing loss compensation was provided, while for the aided condition, individualized amplification was provided to the participants. The stimuli were amplified by applying a linear gain according to the National Laboratories Revised Profound (NAL-RP) prescription (Dillon, 2012). For each participant, the amplification applied to the left and right ears was identical and calculated based on the average hearing threshold across both ears.
Results
Paired Comparisons
Figure 3 shows the percentage of wins from the paired comparison tests for both unaided (left column) and aided (right column) conditions. The top and middle panels compare unprocessed stimuli with stimuli processed by Algo-1 and Algo-2, while the bottom panels compare Algo-1 directly with Algo-2. Different hatches/colors represent different masking conditions (M/F: one male/female interfering speaker and MM/FF: two male/female interfering speakers). For both unaided and aided conditions, the data reveal a relatively similar pattern of ratings, where a clear preference for Algo-2 was observed compared to unprocessed stimuli and Algo-1. Stimuli processed by Algo-2 were favored in comparison to unprocessed stimuli in

Percentage of Wins From the Paired Comparison Tests Obtained for Each Pair of the Three Processing Conditions (Unprocessed, Algo-1, and Algo-2) for Stimuli Having One (
Speech Recognition Thresholds
Figure 4 shows the measured averaged SRTs (top panels) and the corresponding improvements achieved by each algorithm compared to the unprocessed stimuli (bottom panels). For both unaided (left column) and aided (right column) conditions, mean SRTs were considerably lower for female interfering speakers than for male interfering speakers. Mean SRTs obtained for unprocessed stimuli were

Speech Recognition Thresholds (SRTs) Averaged Across All Participants (Top) and Corresponding SRT Improvements (Bottom) Obtained for Stimuli Having Two Female (
It should be noted that, for the unaided condition, the SRT measurement of three participants were invalid for Algo-1 with male interfering speakers. This occurred because participants mistakenly followed one of the interfering speakers instead of the target speaker. As a result, the adaptive procedure kept increasing the SNR until the predefined maximum of
Statistical analyses were performed using a linear mixed-effects model with the lme4 package in R software (Bates et al., 2015), which is well-suited for handling missing data (invalid data from the three participants were considered as missing data). Participants were treated as a random factor. We conducted a comprehensive diagnostic evaluation, including visual inspection of posterior predictions, linearity, homogeneity of variance, collinearity, influential observations, normality of residuals, and normality of random effects using the performance package in R (Lüdecke et al., 2021). Furthermore, we performed contrast analysis with Holm corrections using the model-based package (Makowski et al., 2020), with an alpha level of .05 for all tests. Visual inspections of the residuals of the linear mixed-effects model predicting the outcome of variable SRTs revealed a normal distribution.
A linear mixed-effects model was fitted to the measured SRTs to estimate the fixed effects of processing (unprocessed, Algo-1, and Algo-2), hearing loss compensation (unaided and aided), and the gender of interfering speakers (female and male), along with their two- and three-way interactions. An analysis of variance revealed significant main effects of processing,
Significant effects were further analyzed using contrast analysis to compare the three factors (see Table 1). Table 1 is divided into two parts. The first part presents the main effects of processing, hearing loss compensation, and the gender of interfering speakers. The second part presents the pairwise differences between all levels of processing for each gender of interfering speakers and the pairwise differences between all levels of gender for each processing. Only statistically significant differences are reported. The results revealed significant differences between unprocessed and Algo-2, as well as between Algo-1 and Algo-2, for both male and female interfering speakers. However, no significant difference was found between unprocessed and Algo-1. Additionally, for each processing condition (unprocessed, Algo-1, and Algo-2), a significant difference was observed between male and female interfering speakers.
Results of Contrast Analysis for Predicting the Differences in SRTs. Only Significant Differences Are Reported.
Perceived Listening Effort
Figure 5 shows the median listening effort ratings across participants along with the corresponding benefits for one and two interfering speakers for both unaided and aided conditions as a function of SNR. The first three rows represent the listening effort ratings for unprocessed stimuli, Algo-1, and Algo-2, while the last row represents the listening effort benefits of both algorithms compared to unprocessed stimuli. In general, listening effort ratings systematically decreased with increasing SNR for both one and two interfering speakers (except for Algo-1 at high SNRs), and followed a similar pattern for unaided and aided condition. For unprocessed stimuli and low SNRs, the perceived effort was higher with two interfering speakers compared to one interfering speaker. For two interfering speakers, participants also rated the

Median Perceived Listening Effort Ratings and Benefit Relative to Unprocessed Stimuli as a Function of SNR for Stimuli Having One (
To investigate the effect of SNR and hearing loss compensation on listening effort ratings, we conducted a statistical analysis of the listening effort benefit provided by Algo-1 and Algo-2 compared to the unprocessed stimuli. A linear mixed-effects model was fitted to the listening effort benefit to estimate the fixed effects of processing benefits (Algo-1 vs. unprocessed, and Algo-2 vs. unprocessed), hearing loss compensation (unaided and aided), and SNRs, along with their two-way interactions. An analysis of variance revealed significant main effects for processing benefits,
Results of Contrast Analysis for Predicting the Differences in Listening Effort Benefits. Only Significant Differences Are Reported.
Table 2 is divided into two parts. Part 1 presents the main effect of processing benefit, while Part 2 presents the pairwise comparison of processing benefit for each SNR. Only statistically significant differences are reported. The main effects of SNR and the pairwise comparisons between all SNRs for each processing benefit type were also analyzed. The results revealed that, for each SNR, there was a significant difference in the benefits provided by Algo-1 and Algo-2. Additionally, significant differences were found between most SNR pairs, except for the following cases: (
Discussion
Differences for Unprocessed Stimuli Between NH and HI
The present data revealed considerable differences in terms of SRTs and listening effort ratings between NH (see Figures
Algorithm Benefits for NH Versus HI
Similarly to NH listeners (Sinha et al., 2023), Algo-1 did not provide any improvement over unprocessed stimuli for HI listeners in terms of preference, speech intelligibility, or listening effort, while Algo-2 consistently demonstrated improvements over unprocessed stimuli for all considered evaluation methods. Notably, Algo-2 provided greater improvements for HI listeners compared to NH listeners for all evaluation methods. For NH listeners, Algo-2 improved mean SRTs (
Algo-2 also showed a reduction in listening effort over a broad range of SNRs for both NH listeners (Sinha et al., 2023) and HI listeners for both one and two interfering speakers. However, the benefits were more pronounced for HI listeners with reductions of 7-8 ESCU compared to 4-5 ESCU for NH listeners at medium SNRs. Even at the lowest SNR (
The paired comparison results showed a similar trend for both NH listeners (Sinha et al., 2023) and HI listeners. However, HI listeners displayed a stronger preference for Algo-2, with the majority of their ratings falling into the “much easier” category of the rating scale. Across all evaluation methods, Algo-2 provided greater benefits for HI listeners compared to NH listeners. This overall result aligns with findings from Thoidis and Goehring (2024), who observed similar outcomes in a double-blind sentence recognition test with a different SC-TSE algorithm.
We also performed Pearson correlation analyses to assess whether hearing loss severity, as measured by PTA4 (averaged across both ears), is related to the SRTs and listening effort benefits of Algo-2, analyzed separately for male and female interfering speakers. No significant correlations were observed between hearing loss severity and algorithm benefit across any condition, unaided or aided, with one or two interfering speakers, or at any SNR.
Impact of Algorithmic Artifacts
In general, SC-TSE algorithms may introduce artifacts in the processed signals, mainly depending on the SNR of the mixture, the number of interfering speakers and the used speaker separator network. Typical artifacts include distortions of the extracted target speaker and residual interference from the interfering speaker(s). As already mentioned, Algo-1 performs target speaker extraction in the STFT domain using a real-valued mask (hence using the mixture phase) with a relatively simple network architecture, whereas Algo-2 performs target speaker extraction in the time domain with a more complex network architecture. Several studies (Ge et al., 2020; Pandey & Wang, 2018; Xu et al., 2019) have shown that real-valued spectral masking-based approaches typically introduce more artifacts compared to time-domain approaches.
A significant impact of artifacts introduced by Algo-1 was also observed in HI listeners. Algo-1 not only failed to reduce listening effort compared to unprocessed stimuli but, in some cases even increased the required effort. Specifically, the listening effort was higher at
Effects of Hearing Loss Compensation
Apart from the appearance of such invalid tracks during the SRT measurements, this study found no significant differences between measurements without (unaided) and with (aided) hearing loss compensation. Results from all three evaluation methods showed very similar patterns for both unaided and aided conditions. Even though SRTs for the aided condition (unprocessed, Algo-1, and Algo-2) were slightly lower than for the unaided condition, the difference was small (
One limitation of this study is that all participants had a relatively symmetric hearing loss; it is possible that the role of hearing loss compensation could also be different for the participants with asymmetric hearing loss, where both ears should be amplified differently. Additionally, all participants had mild to moderate hearing loss. Therefore, the effects of hearing loss compensation observed here may not generalize to individuals with more severe hearing loss, who may respond differently to SC-TSE algorithms. Future work will include a broader range of hearing loss severities to better understand how hearing loss compensation influences algorithmic performance across diverse listener profiles.
Limitations of Algorithms and Future Directions
The results of this study suggest that HI listeners could benefit significantly from SC-TSE algorithms if these are implemented in practical applications such as hearing aids. However, for an algorithm to be suitable for hearing aids, it needs to meet specific requirements. The algorithm needs to be capable of real-time processing, that is, have low computational complexity and an algorithmic latency of
Conclusions
The following main conclusions can be drawn from this study:
Similar to findings with NH listeners, Algo-2 demonstrated considerable benefits for all considered evaluation methods with HI listeners, while Algo-1 did not show any benefit compared to unprocessed stimuli. HI listeners experienced a greater reduction in listening effort at lower SNRs and a greater improvement in SRTs compared to NH listeners, particularly in the presence of female interfering speakers, while both groups exhibited similar SRT gains when the interfering speakers were male. This suggests that SC-TSE algorithms can be effective in enhancing target speech perception and reducing the perceived listening effort for HI listeners, potentially providing even greater benefits than for NH listeners. Artifacts introduced by SC-TSE algorithms, especially Algo-1, were observed to impact HI listeners, similarly as for NH listeners. Although susceptibility to processing artifacts may be increased in unaided conditions for some listeners, none of the evaluation methods showed a significant impact of hearing loss compensation. Results for both unaided and aided conditions were relatively similar for all considered evaluation methods, indicating that the benefit of SC-TSE algorithms does not depend on hearing loss compensation using linear amplifications.
Footnotes
Acknowledgments
The authors thank Jonathan Albert Göäwein for his valuable suggestions and insightful discussions regarding the statistical analysis of data.
Funding
The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: The Oldenburg Branch for Hearing, Speech and Audio Technology HSA is funded in the program
Declaration of Conflicting Interest
The authors declared no potential conflicts of interest for the research, authorship, and/or publication of this article.
Data Availability Statement
All data supporting this study are not publicly available due to privacy and ethical reasons. However, they can be made available from the corresponding author upon a reasonable request. The evaluation stimuli were used as the same stimuli used in Sinha et al. (2023).
