Abstract
Dynamic range compression is a compensation strategy commonly used in modern hearing aids. Fast-acting systems respond relatively quickly to the fluctuations in the input level. This allows for more effective compression of the dynamic range of speech and hence enhanced the audibility of its low-intensity components. However, such processing also amplifies the background noise, distorts the modulation spectra of both the speech and the background, and can reduce the output signal-to-noise ratio (SNR). Recently, May et al. proposed a novel SNR-aware compression strategy, in which the compression speed is adapted depending on whether speech is present or absent. Fast-acting compression is applied to speech-dominated time–frequency (T-F) units, while noise-dominated T-F units are processed using slow-acting compression. It has been shown that this strategy provides a similar effective compression of the speech dynamic range as conventional fast-acting compression, while introducing fewer distortions of the modulation spectrum of the background and providing an improved output SNR. In this study, this SNR-aware compression strategy was compared with conventional fast- and slow-acting compression in terms of speech intelligibility and subjective preference in a group of 17 hearing-impaired listeners with varying degree of hearing loss. The results show a speech intelligibility benefit of the SNR-aware compression strategy over the conventional slow-acting system. Furthermore, the SNR-aware approach demonstrates an increased subjective preference compared with both conventional fast- and slow-acting systems.
Keywords
Sensorineural hearing loss is associated with a decreased sensitivity to low-intensity sounds as well as a range of suprathreshold auditory deficits. These deficits include, among others, the phenomenon of loudness recruitment and the limitation of the dynamic range (e.g., Bacon & Oxenham, 2004; Smeds & Leijon, 2011). To account for this, modern hearing aids typically implement some form of level-dependent amplification such as wide dynamic range compression (WDRC, see Souza, 2002, for a review). Such systems provide relatively high gain for low-intensity input sounds to ensure sufficient audibility, which appears to be necessary for good speech recognition (Pavlovic & Studebaker, 1984; Souza & Turner, 1999; Woods et al., 2013). As the input level increases, the gain is reduced to avoid loudness discomfort. To follow the temporal dynamics of speech, a compression system should respond rapidly to changes in the input level across time (Edwards, 2004; Moore, 2008; Souza, 2002). This requires the use of short time constants in the level estimation stage of the signal-processing chain (for implementation details, see Giannoulis et al., 2012; Kates, 1993). However, the application of short time constants can also lead to rapid fluctuations in the gain function over time, introducing potentially detrimental distortions of the temporal envelope of speech (e.g., Gatehouse et al., 2006; Jenstad & Souza, 2005, 2007; Plomp, 1988; Souza et al., 2012a; Walaszek, 2008). A number of studies have shown that fast-acting WDRC provides an improvement in audibility of speech sufficient to offset the potentially detrimental distortion of the temporal envelope of the signal, leading to a net intelligibility benefit. This was demonstrated for speech in quiet by Villchur (1973), Souza and Turner (1998, 1999), Souza and Bishop (1999), and Davies-Venn et al. (2009). An acoustic analysis conducted by Alexander and Rallapalli (2017) showed that fast-acting compression leads to a higher effective compression ratio (ECR, based on short-term level histograms1) compared with slow-acting compression. This has a positive effect on speech audibility but, on the other hand, negatively affects the speech modulation transfer function (MTF). The speech-recognition results reported in the same study suggest that, in many cases, the audibility benefit counteracts the negative effects of envelope-domain distortion.
When the target speech is degraded by background noise, the benefit of WDRC appears to depend on a variety of factors such as the spectrotemporal characteristics of the noise, the overall input level, and the signal-to-noise ratio (SNR), as demonstrated, for example, by Yund and Buckles (1995). Souza et al. (2006) demonstrated that the presence of background noise decreases the overall amount of envelope fluctuations, leading to less dynamic changes in the gain function and, as a result, a decreased ECR of speech. Rhebergen et al. (2009) reported beneficial effects of compression on the speech reception threshold (SRT) when the processing was applied to the speech alone prior to mixing it with the background noise. However, such conditions are rather artificial. Rhebergen et al. considered also a more realistic scenario, in which the processing was instead applied to the mixture of speech and either a stationary or a nonstationary, interrupted noise. In that case, compression had a pronounced beneficial effect on the SRT in the interrupted noise. Similar findings were reported in a later study by Rhebergen et al. (2017). At negative SNRs (as was the case in both studies of Rhebergen et al.), the interferer is the more dominant stimulus and its temporal fluctuations drive the compression system. The gain is increased during the dips in the noise, amplifying the low-level glimpses of speech present in those dips. The results of Desloge et al. (2017) and Kowalewski et al. (2018) further support the notion that fast-acting compression systems provide improved short-term audibility and increased opportunities for glimpsing, as long as the noise exhibits prominent fluctuations and the long-term input SNR is negative.
In contrast, in scenarios characterized by high long-term input SNRs, the compression is driven mostly by the changes of the speech level. The fast changes in gain cyclically amplify the background, introducing modulation components to the noise (Stone & Moore 2003, 2004, 2007, 2008) and reducing the long-term output SNR (Hagerman & Olofsson, 2004; May et al., 2018; Naylor & Johannesson, 2009; Rhebergen et al., 2009, 2017; Souza et al., 2006). Both effects are potentially detrimental to speech intelligibility and the perceived sound quality. Taken together, the previous findings indicate that fast-acting compression has rather positive effects on speech intelligibility due to increased audibility and a reduced dynamic range in the following scenarios: (a) speech in quiet, (b) speech in the presence of a strongly fluctuating noise at a negative SNR, and (c) speech compressed prior to mixing it with noise (unrealistic). These benefits are largely reduced, or turn into a detriment, as soon as the input SNR becomes positive (which is a common scenario, see Smeds et al., 2015; Weisser & Buchholz, 2019) and/or when the interferer is stationary. It is nevertheless possible that the advantages of fast-acting compression would be restored if a selective processing of the speech and the noise components could be achieved.
Several studies have focused on the effects of compression release time on listener’s subjective preference and/or perceived quality. Their conclusions are largely in line with the aforementioned studies on speech intelligibility. Neuman et al. (1995) investigated hearing-impaired (HI) listeners’ overall preference for the compression release time (60, 200, and 1000 ms) when processing speech in the presence of background noise of varying characteristics and levels. Overall, longer release times were preferred for the types of noise naturally characterized by higher sound pressure levels (SPLs). In a follow-up study using the same set of conditions (Neuman et al., 1998), the listeners rated several attributes of sound quality. The results indicated that, with longer compression release times, the ratings of the overall impression, pleasantness, and clarity increased, while the rating of noisiness decreased. This was likely due to the above-mentioned cyclical amplification of the background noise that occurs at positive input SNRs. The effect becomes more prominent with shorter release times (as more gain is provided to the noise during the speech gaps) and is more noticeable as the level of the background increases. A similar preference for longer release times was demonstrated by Hansen (2002) in a group of HI listeners and a range of acoustic scenarios. Neuman et al. (1995) suggested the use of an adaptive release time in hearing aids in order to improve the perceived sound quality. A shorter release time could be used in quieter scenarios, while a longer release time could be applied with increasing levels of background noise. Several adaptive compression strategies have been proposed in the past including the K-AMP (Killion et al., 1992), the dual front-end adaptive gain control (Moore & Glasberg, 1988), the guided level estimator (Neumann, 2008), and the short-term dynamic-range-driven system proposed by Lai et al. (2013). However, all of these systems rely on short-term level dynamics of the speech and noise mixture and do not explicitly utilize information related to the presence of the target signal with respect to the background noise.
The SNR-aware dynamic range compression strategy presented by May et al. (2018) attempts to combine the advantages of both fast- and slow-acting compression. The main idea is to adjust the release time of the compressor in each individual time–frequency (T-F) unit depending on whether the target is present or absent. Specifically, a short release time is applied to speech-dominated T-F units where the short-term SNR is high, while a longer release time is used to process T-F units that are dominated by noise. The SNR-aware compression strategy bears some similarities with the aforementioned artificially created scenario tested by Rhebergen et al. (2009), where the speech alone was compressed prior to mixing it with noise. The difference is that the SNR-aware approach operates on the noisy speech mixture and does not require the availability of separate speech and noise signals, making it potentially applicable in hearing devices. Similar principles had previously been applied in the compression system driven by the direct-to-reverberant energy ratio, which was shown to preserve the listeners’ spatial perception (Hassager et al., 2017). May et al. (2018) provided an instrumental evaluation of the SNR-aware compression strategy and compared it with conventional fast- and slow-acting compression. The SNR-aware compression strategy provided ECRs similar to those obtained with conventional fast-acting compression, while the natural fluctuations in the background noise were preserved in a similar way as when slow-acting compression was applied.
In this study, the SNR-aware dynamic range compression strategy was evaluated in terms of speech intelligibility and subjective preference in a group of HI listeners. It was hypothesized that the SNR-aware compression strategy would provide superior audibility compared with slow-acting compression, while it would result in a higher output SNR and introduce fewer distortions of the background compared with fast-acting compression, leading to superior speech intelligibility performance and higher preference scores. To exclude the potential effects of SNR estimation errors on perception, the ideal SNR-aware strategy based on the a priori SNR was tested.
Methods
Participants
The study included 17 HI listeners aged 25 to 80 years (average 68.7 years). All participants underwent screening conducted by a trained audiologist, which included tympanometry, pure-tone audiometry (air and bone conduction), and word recognition scores in quiet (discrimination scores) using the Dantale corpus (Elberling et al., 1989). Based on this evaluation, all listeners’ hearing loss was classified as sensorineural. The listeners’ audiograms were compared with the standard audiograms proposed by Bisgaard et al. (2010) and were further classified into three groups based on the smallest absolute distance criterion (in dB): seven listeners in the N2 group, seven listeners in the N3 group, and three listeners in the N4 group. The tested ear was chosen based on the best match to the desired hearing profile and/or the best discrimination score. To ensure that an SRT could be reliably measured in noise, a discrimination score exceeding 80% was required as an inclusion criterion. The listeners’ audiograms are shown in Figure 1. All listeners were native speakers of the Danish language. After a short introduction to the test procedure, they provided an informed consent. All listening tests were conducted at the Technical University of Denmark. The experiments were approved by the Science-Ethics Committee for the Capital Region of Denmark (Reference H-16036391).

Pure-Tone Audiograms of Listeners in the N2, N3, and N4 Groups. Individual audiograms of all listeners in a given group are shown with a gray line, while the corresponding standard audiogram is indicated by the thick black line. The audiograms are shown for frequencies up to 6 kHz, which is the uppermost frequency in the profiles provided by Bisgaard et al. (2010).
Signal Processing and Fitting
All dynamic range compression systems were based on the short-time discrete Fourier transform using frames of 10 ms duration with 75% overlap and operated in seven independent octave-wide frequency channels with center frequencies ranging from 125 to 8000 Hz. The level estimation in each frequency channel was performed using a first-order infinite impulse response filter with different time constants associated with the attack and the release (Kates, 1993). As shown in Table 1, the following three compression systems were tested: conventional fast- and slow-acting compression as well as SNR-aware compression. The attack time in the level estimator was always set to 5 ms. The fast-acting system utilized a level estimator with a short release time of 40 ms, while it was set to 2000 ms for the slow-acting system. The level estimator in the SNR-aware system switched between the short and the long release time in individual T-F units by applying a threshold criterion of 0 dB to the a priori SNR. If the a priori SNR was higher than 0 dB, the corresponding T-F unit was processed with the short release time, resulting in a fast-acting system. Otherwise, if the a priori SNR was lower than 0 dB, the long release time was used, resulting in a slow-acting system. The a priori SNR was calculated by comparing the energy of the separate speech and noise signals in individual T-F units. More details concerning the implementation of the algorithms can be found in May et al. (2018).
Configuration of the Three Tested Compression Schemes.
Note. SNR = signal-to-noise ratio.
The compression thresholds (CTs) in each frequency channel were calibrated using a stationary noise with an SPL of 50 dB and a spectrum that was spectrally matched to the long-term average spectrum of the Danish hearing-in-noise test (HINT) corpus. Linear (level-independent) gain was applied below the CT. The linear gain and compression ratios (CRs) were calculated from the insertion gain for 50 and 80 dB SPL prescribed by the National Acoustic Laboratories Non-Linear 2 (NAL-NL2; Keidser et al., 2011) rationale. In the fitting software, the settings unilateral and slow were selected. The former setting was chosen to take the monaural presentation of the stimuli into account. The latter setting was chosen because the NAL-NL2 rationale provides higher nominal CRs for slow-acting compression (Keidser et al., 2011), which should further increase the acoustic differences between the processing conditions. To reduce the inter-listener variability of the compression parameters, the CRs were fitted on a group level. The CRs for each group of listeners were based on the fitting to the respective standard audiograms (i.e., N2, N3, or N4). Table 2 shows the CTs and the CRs for individual frequency channels. The linear gain, on the other hand, was fitted individually for each listener for the sake of audibility of the stimulus portions that fall below the CT.
Compression Thresholds (CTs) in dB and Compression Ratios (CRs) for Individual Channel Center Frequencies.
Stimuli and Procedure
Noisy speech sampled at a rate of 20 kHz was created by mixing clean speech from the Danish HINT corpus (Nielsen & Dau, 2011) with the following two noise types: the stationary International Collegium of Rehabilitative Audiology (ICRA)-1 noise (Dreschler et al., 2001) and the factory noise from the NOISEX database (Varga & Steeneken, 1993). The factory noise was a recording from an industrial production plant, consisting of various acoustic events, including machine and conveyor belt sounds, with a moderate degree of reverberation. It therefore contained natural spectrotemporal fluctuations, in contrast to the stationary background (which only contained intrinsic temporal fluctuations). The two noise types were chosen in order to investigate potential perceptual effects of spectrotemporal interactions between speech and the background. Both were spectrally matched to the long-term average spectrum of the HINT corpus measured in one-third octave bands. For each noisy speech mixture, a random noise segment was selected. A noise-only segment of 1 s duration was included before and after each sentence.
The administration of the tests and the preprocessing of stimuli were performed using a personal computer running M
SRT Determination
The experimental session began with measuring the SRT in each noise type using conventional fast-acting compression. Scoring was performed on a sentence basis, that is, a correct recall of all five words was required to mark the presented sentence as correct. Each list consisted of 20 sentences. Following the listener’s response to each sentence, the SNR for the next sentence was determined and stored (also following the last sentence on the list, yielding 21 stored SNRs, Nielsen & Dau, 2011). The start SNR was +5 dB. If the first sentence was not correctly identified, it was repeated with an increasing SNR until recalled correctly. The initial step size was 4 dB and was reduced to 2 dB after the first five sentences (Nielsen & Dau, 2011). The SRT was determined as the average of the SNRs from sentence 6 to 21. For each noise type, a training trial was conducted using the HINT training lists. Subsequently, two estimates of the SRT were made (test trials) using an HINT test list selected at random (without replacement). The final SRT value for each noise type was determined as a mean of the values obtained using the two test lists. The starting noise type was selected at random and the noise types were subsequently alternated.
Fixed-SNR Sentence-Recognition Scores
A sentence-recognition score was determined for each of the six conditions (2 Noise Types × 3 Processing Strategies). The SNR was fixed for each noise type and equal to the corresponding SRT, determined in the first part of the experiment. The order of the conditions was randomized for each listener. However, each test list was immediately preceded by a training list in the corresponding condition, in order to familiarize the listeners with the given combination of noise and processing type over a broad range of SNRs. The six HINT test lists remaining after SRT determination were selected at random (without replacement). The training lists were used with replacement, such that some of the training lists were experienced by the listeners multiple times in different conditions throughout the entire experiment.
Paired-Comparison Preference Test
For each of the two noise types, comparisons between all three processing types were made (six comparisons in total). Each listener completed 3 trials, for a total of 18 comparisons (except for 1 participant who completed only 2 trials or 12 comparisons).
Before each trial, three sentences from the HINT corpus were selected at random and concatenated to create a running speech sample. The sample was mixed with the background noise at the same SNR as used in the preceding sentence-recognition test. In each presentation, the speech-in-noise sample was processed with each of the processing strategies and presented to the listeners as interval A and interval B in a two-alternative forced-choice manner. The question displayed on the screen was “Which interval do you prefer?”. The order of the processing conditions and noise types was randomized within a trial. The assignment of processing conditions to A or B was also randomized within each presentation. The listeners were blinded to the processing conditions being compared. They had to listen to the entire length of A and B prior to indicating their preference. They could also repeat each of the intervals separately as many times as needed. The listeners were instructed to base their decisions on subjective judgments of overall sound quality and to pay attention to such attributes as quality of the speech, subjective intelligibility, characteristics of the noise or listening comfort, but not to focus on one single attribute in particular. If there was no perceived difference between the intervals, the listeners were instructed to pick an interval at random.
Results
Speech reception thresholds
The individual SRTs are shown in Figure 2 for each noise type as a function of the hearing profile (N2, N3, and N4). A two-way fixed-factor analysis of variance (ANOVA) was conducted on the SRT data, with factors noise type and hearing profile. It has to be noted that the results of the analysis should be interpreted carefully, as the N4 group included a smaller number of participants than the other two groups (three vs. seven). On the group level, the results indicated a significant main effect of hearing profile (F = 49.71, df = 2, p < .001) and no effect of noise type, nor any significant interaction between noise type and hearing profile. The SRT averaged across noise types and all listeners within a hearing profile was 0.26, 4.48, and 13.20 dB for the N2, N3, and N4 groups, respectively. These mean SRT values are indicated by the black circles in Figure 2.

Individual SRTs of all Listeners for ICRA-1 and Factory Noise as a Function of the Hearing Profile (N2, N3, and N4). The mean SRT averaged across noise types and all listeners within a hearing profile are shown by the black circles.
Sentence Scores
A rationalized arcsine units (RAUs) transform (Studebaker, 1985) was applied to the sentence-recognition scores expressed in percent correct. The RAU-transformed scores were averaged across listeners and are shown in Figure 3 as a function of the processing type (fast, slow, and SNR-aware compression) for the ICRA-1 noise (left panel) and the factory noise (right panel). Subsequently, a three-way, mixed-effects ANOVA was conducted on the transformed data. The fixed factors were noise type (with two levels: ICRA-1 and factory), processing type (with three levels: fast, slow, and SNR-aware) and listener. The listener was included as a random factor to account for the variability in the degree of hearing loss, differences in audibility, sensitivity to distortion, the operating SNR, and so on (Naylor, 2016). In addition, all possible first-order interactions were included.

RAU-Transformed Sentence Recognition Scores Averaged Across Listeners as a Function of the Processing Type (Fast, Slow, and SNR-Aware Dynamic Range Compression) for ICRA-1 Noise (Left Panel) and Factory Noise (Right Panel). The error bars indicate the standard errors of the mean. RAU = rationalized arcsine units; SNR = signal-to-noise ratio; ICRA-1 = International Collegium of Rehabilitative Audiology.
The ANOVA revealed a large and significant main effect of processing type (F = 4.07, df = 2, p = .0266, partial η2 = 0.21, Cohen, 1973). Moreover, a significant interaction between the noise type and listener was found (F = 2.84, df = 16, p = .0059, partial η2 = 0.59). The interaction between the factors noise type and processing type did not reach statistical significance (F = 2.61, df = 2, p = .089). Therefore, in the post hoc analysis, the results were pooled across both noise types. For each processing type, the RAU-transformed scores were averaged across listeners, as shown in Figure 4. For the sake of comparison of the means, 95% confidence intervals were constructed based on the mean squared error from the ANOVA and their lengths were adjusted using Bonferroni corrections to account for multiple comparisons. The post hoc analysis revealed no statistically significant differences between the fast and the SNR-aware system, nor between the fast and the slow system. The only statistically significant difference was found between the slow and the SNR-aware system (61.4 vs. 53.2 RAU, p < .05).

RAU-Transformed Sentence-Recognition Scores Averaged Across Listeners and Noise Types as a Function of the Processing Type (Fast-, Slow-, and SNR-Aware Dynamic Range Compression). The error bars represent the 95% confidence intervals (see the main text for details). Level of statistical significance of the difference of means is indicated as follows: * .05 or ns = nonsignificant. RAU = rationalized arcsine units; SNR = signal-to-noise ratio; ICRA-1 = International Collegium of Rehabilitative Audiology.
Subjective Preference
For each noise type, data from 150 paired-comparison trials were collected (16 Listeners × 9 Trials + 1 Listener × 6 Trials). For each listener, the trials were evaluated for consistency in terms of transitivity, and the trials containing circular triads were rejected3(see Kendall, 1962; Kendall & Smith, 1940, for a detailed discussion). Overall, 111 of the 150 trials for the ICRA-1 noise and 120 of the 150 trials for the factory noise were considered for further analysis. For each noise type, the responses from the remaining trials were pooled together to create response matrices. These matrices are summarized in terms of the number of wins for each strategy in the top panels of Figure 5. Subsequently, the values in the response matrices were converted to relative frequency and evaluated for weak stochastic transitivity4 (Ellermeier et al., 2004). The weak stochastic transitivity was maintained for both noise types, which allowed to fit a more restrictive Bradley–Terry–Luce (BTL) model (Bradley & Terry 1952; Ellermeier et al., 2004; Luce, 1959). The BTL model was evaluated separately for each noise type using the M

Results of the Subjective Preference Test as a Function of the Processing Type (Fast-, Slow-, and SNR-Aware Dynamic Range Compression) for ICRA-1 Noise (Left Panels) and Factory Noise (Right Panels). The panels in the top row show the number of wins based on the consistent trials from all listeners. The panels in the bottom row show the corresponding ratio-scale values derived from the BTL model, including the 95% confidence intervals (see the main text for details). Level of statistical significance is indicated as follows: *.05, **.01, ***.001 or ns = nonsignificant; SNR = signal-to-noise ratio; BTL = Bradley–Terry–Luce; ICRA-1 = International Collegium of Rehabilitative Audiology.
Discussion
The purpose of this study was to conduct a perceptual evaluation of the novel SNR-aware compression strategy proposed by May et al. (2018) in HI listeners. Three audiometrically profiled groups were tested: N2, N3, and N4. Two noise types were considered: ICRA-1 stationary speech-shaped noise and factory noise from the NOISEX database. The SNR-aware strategy was compared with conventional fast- and slow-acting compression systems. For each noise type, the listeners’ individual SRTs were determined using fast-acting compression. The corresponding SNR values were subsequently used for obtaining sentence-recognition scores at a fixed SNR, as well as preference ratings using a paired-comparison paradigm.
Compression Strategy
The ANOVA of sentence-recognition scores indicated a statistically significant main effect of processing type and no main effect of noise type. Moreover, the interaction between the noise type and the processing type did not reach statistical significance. However, the following trend was observed in the RAU-transformed sentence-recognition scores shown in Figure 3. In the ICRA-1 noise, it appears that there are almost no differences between the (averaged) scores. While a small advantage of fast- versus slow-acting compression was found in the factory noise condition, a larger advantage over either of the two conventional schemes was obtained with the SNR-aware processing scheme. Because the interaction was not statistically significant, the subsequent post hoc tests had to be conducted on scores pooled across noise types. Nevertheless, it appears that the pattern observed in the analysis might be blurred by the outcomes obtained with the ICRA-1 noise. The post hoc tests revealed an advantage of the SNR-aware strategy over conventional slow-acting compression and no difference between the SNR-aware and the conventional fast-acting processing.
Compared with slow-acting compression, fast-acting compression of speech provides ECRs that are closer to the nominal CR prescribed by the gain rationale, resulting in improved audibility. The results of this study suggest that these acoustic effects are necessary (but not sufficient) for an improved speech recognition in noise. If conventional processing is applied, those positive effects are likely offset by a distortion of the noise modulation spectrum and a reduction of the long-term broadband SNR. To take full advantage of fast-acting compression, a differentiation between the target and the background is required, followed by applying some distinct processing to the two signal components (foreground vs. background). This is achieved by the SNR-aware compression strategy and seems to provide a more favorable balance between audibility and ECR improvement versus MTF- and SNR-distortion. Moreover, as mentioned earlier, the advantage of the SNR-aware strategy seems to be more pronounced in the factory noise condition. This could stem from the stronger interaction between the speech and the background noise due to natural envelope fluctuations occurring in the two signals. The SNR-aware compression strategy reduces this interaction which could be advantageous for speech recognition. However, this interpretation has to be treated with caution due to the weak statistical evidence supporting it.
The subjective preference scores indicated an advantage of the novel SNR-aware compression strategy over both conventional fast- and slow-acting processing for both noise types. In addition, an advantage of slow- over fast-acting compression was observed in the stationary ICRA-1 noise but not in the nonstationary factory noise. This suggests that the cyclical amplification has a more prominent negative effect on the perceived quality in stationary backgrounds. This is consistent with the conclusion drawn by Neuman et al. (1995), that the cyclical pumping becomes more noticeable as more noise is present in the speech gaps. Informally, some of the participants in this study reported that most of the perceived differences between the compared strategies were in the characteristics of the background noise. The additional advantage of SNR-aware over slow-acting compression likely stems from the increased ECR and improved audibility, which are potentially linked to improved speech intelligibility. It is likely that the listeners’ ability to comprehend the processed speech material was an important factor that contributed to the overall preference judgment. This is consistent with the studies by Preminger and Van Tasell (1995) and Hansen (2002). Preminger and Van Tasell investigated the effects of different frequency shaping on normal-hearing listeners’ ratings in terms of several attributes of subjective sound quality such as intelligibility, pleasantness, listening effort, loudness, and overall impression. They found that ratings across the other dimensions were correlated with the ratings of intelligibility. Hansen tested HI listeners’ preference in terms of several attributes of sound, including subjective intelligibility using WDRC-processed stimuli with various combinations of time constants and CTs. The conditions yielding the highest overall preference also corresponded to the highest preference in terms of subjective intelligibility.
Listener-Specific Factors
As expected, the SRT depended on the degree of hearing loss and was highest (the worst) in the N4 group, which is shown in Figure 2 and indicated by the ANOVA. The N4 listeners were hence tested at the highest SNRs in the subsequent parts of the experiment. Therefore, they experienced greater acoustic differences between the processing strategies (May et al., 2018). This phenomenon was described by Naylor (2016) as selection-treatment interaction, that is, a situation in which the selection of the participants (their hearing profiles and therefore the SRTs) influences the magnitude of the differences across treatments (processing strategies) and was the main reason to include listener as a random factor in the statistical analysis of sentence recognition scores. As the listeners were tested in the vicinity of the steepest point on the psychometric function, large acoustic differences were, in turn, expected to create large perceptual differences. The beneficial effects of SNR-aware compression might be even larger if more listeners would be included in the N4 group. As mentioned earlier, for those listeners, the operational point is shifted toward higher SNRs relative to the N2 and N3 groups, and hence the acoustical differences between the strategies are greater. It is even possible that at such high SNRs, the differences in perception are driven mostly by the changes in the acoustics of speech, that is, the high ECR of speech achieved by conventional fast-acting and SNR-aware processing compared with slow-acting compression (see May et al., 2018; Figure 3), and not by the interaction of speech and noise. If this was the case, the speech-intelligibility benefit of fast- over slow-acting compression would increase with increasing SRT. However, a regression analysis did not indicate any significant correlation of the two outcomes. Moreover, this prediction is based on an assumption that applying fast-acting compression to the target is always desirable. It is possible that, due to greater suprathreshold auditory processing deficits, more severely impaired listeners rely more strongly on the temporal-envelope cues present in the speech signal itself—a notion supported by the studies of Souza et al. (2005), Souza et al. (2012b), and Souza et al. (2015b). In that case, any form of fast-acting compression could be detrimental to speech recognition by those listeners, negatively affecting their perception despite seemingly positive acoustical effects. The regression analysis revealed that neither the pure-tone average nor age could predict the differences in performance between processing types. Some form of a psychoacoustic metric of sensitivity to temporal-envelope distortion could potentially identify the listeners who are likely to be negatively affected by fast-acting processing. However, to date no such test exists, especially taking practical considerations in a clinical environment into account. Some evidence suggests that HI individuals with high working-memory capacity are better able to take advantage of fast-acting processing of the speech signal (see Souza et al. 2015a, for a review). It is possible that, in this study, such participants took greater advantage of the differential processing of the target and the background noise. A measure of working-memory capacity was not included in this study design. Nevertheless, considering this factor in future investigations could help to establish whether the cognitively high-performing listeners indeed benefit more from the SNR-aware compression strategies and hence allow for a more individualized fitting.
Limitations
The paired comparisons were conducted using noisy speech at a relatively low SNR, corresponding to the SRT. This allowed to measure both intelligibility and subjective preference in the same acoustic conditions. However, such conditions are not optimal for evaluating the overall sound quality, because listeners may not be able to focus on a broader range of attributes due to the low intelligibility. The listeners’ preference might, in fact, be confounded solely by the differences in intelligibility between the processing types. A potential solution would be to adjust the SNR individually for each processing type, that is, to measure the SRT for all processing types instead of measuring it only for the fast-acting compression, reflecting an iso-intelligibility rather than an iso-SNR comparison. One could also conduct the paired comparisons at a higher SNR or even at a range of SNRs, revealing any potential effects of the SNR on the subjective preference. Moreover, apart from the overall preference, an explicit evaluation in terms of specific attributes such as subjective intelligibility, noisiness, or clarity could be employed, as was done in the studies of Neuman et al. (1998) and Hansen (2002).
The frequency response of the headphone was equalized to have a flat response with reference to the ear-canal entrance, as described in the Stimuli and procedure subsection. As a consequence, the acoustic gain due to the pinna and the concha was not included in the presentation. This reduced the unaided response by 5 to 10 dB in the 2 to 4 kHz range, leading to a systematic mismatch between the aided response the NAL-NL2 target. Nevertheless, an exact match to the NAL-NL2 target for each listener was only possible at relatively low SPLs. This is because the level- and frequency-dependent gain values were based on the individual targets only for input SPLs up to 50 dB. At higher input SPLs, the gain was based on CRs that were fitted on a group level (N2, N3, and N4) and not on the individual prescription. Moreover, this mismatch is mostly within the fit-to-target tolerances of ±5 dB for frequencies up to 2 kHz and ±8 dB above 2 kHz, as recommended by Gatehouse et al. (2001) and widely used in clinical settings during real-ear verification. While this effect might have affected audibility and spectral shaping (potentially relevant for sound quality), it has been present across all compression settings and was included in the SRT determination. Therefore, it is unlikely that it would have affected the study outcomes.
Finally, the results presented in this study evaluated the ideal SNR-aware compression strategy based on the a priori SNR. To apply this strategy in the context of hearing aids, the ideal speech detector needs to be replaced by an estimator that only has access to the noisy speech signal. The comparison in May et al. (2018) showed that a set of instrumental metrics was very similar for the SNR-aware system using either the estimated or the a priori SNR, indicating that a similar performance may be expected in the perceptual tasks. However, future work should evaluate the influence of SNR estimation errors on perception via behavioral listening tests.
Applicability to Real-World Scenarios
This study focused on the perceptual benefit of SNR-aware compression when processing noisy speech. This study did not take the effect of the overall SPL of the speech and noise components into account. The conditions were chosen to emphasize the influence of audibility on the outcome metrics; hence, a relatively low input noise SPL of 50 dB was selected. Hence, in many cases, the speech level was below normal conversational levels. It is possible that the balance between different cues provided by slow- and fast-acting compression would change at higher noise levels, which occur quite frequently in real-world scenarios (Smeds et al., 2015; Weisser & Buchholz, 2019). It would therefore be interesting to investigate a condition with a notably higher background noise SPL (i.e., 65 or 70 dB).
Another factor that is present in many real-world acoustic scenarios, but not considered here, is reverberation. To take advantage of fast-acting compression of the speech signal in even more realistic scenarios where both room reverberation and interfering noise are present simultaneously, it is necessary to update the speech detection stage (e.g., with the power spectral density estimator proposed by Kuklasiński et al., 2016). When dealing with multiple competing sound sources that are spatially separated, the detection of speech-dominated T-F units could alternatively be accomplished by the analysis of spatial cues (May et al., 2011).
Conclusion
A perceptual evaluation of the SNR-aware compression strategy proposed by May et al. (2018) was conducted in controlled laboratory conditions in a group of HI listeners. The strategy was shown to provide a speech intelligibility benefit in noise compared with conventional slow-acting compression and achieved a higher subjective preference compared with both conventional fast- and slow-acting compression schemes. Future research will characterize those listeners that benefit the most from this new compression scheme and will determine the applicability to a broader range of acoustic conditions.
Footnotes
Declaration of Conflicting Interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This research was supported by the Technical University of Denmark and Centre for Applied Hearing Research.
