Abstract
Dynamic range compression is a compensation strategy commonly used in modern hearing aids. Fast-acting systems respond relatively quickly to the fluctuations in the input level. This allows for more effective compression of the dynamic range of speech and hence enhanced the audibility of its low-intensity components. However, such processing also amplifies the background noise, distorts the modulation spectra of both the speech and the background, and can reduce the output signal-to-noise ratio (SNR). Recently, May et al. proposed a novel SNR-aware compression strategy, in which the compression speed is adapted depending on whether speech is present or absent. Fast-acting compression is applied to speech-dominated time–frequency (T-F) units, while noise-dominated T-F units are processed using slow-acting compression. It has been shown that this strategy provides a similar effective compression of the speech dynamic range as conventional fast-acting compression, while introducing fewer distortions of the modulation spectrum of the background and providing an improved output SNR. In this study, this SNR-aware compression strategy was compared with conventional fast- and slow-acting compression in terms of speech intelligibility and subjective preference in a group of 17 hearing-impaired listeners with varying degree of hearing loss. The results show a speech intelligibility benefit of the SNR-aware compression strategy over the conventional slow-acting system. Furthermore, the SNR-aware approach demonstrates an increased subjective preference compared with both conventional fast- and slow-acting systems.
Keywords
Sensorineural hearing loss is associated with a decreased sensitivity to low-intensity sounds as well as a range of suprathreshold auditory deficits. These deficits include, among others, the phenomenon of loudness recruitment and the limitation of the dynamic range (e.g., Bacon & Oxenham, 2004; Smeds & Leijon, 2011). To account for this, modern hearing aids typically implement some form of level-dependent amplification such as wide dynamic range compression (WDRC, see Souza, 2002, for a review). Such systems provide relatively high gain for low-intensity input sounds to ensure sufficient audibility, which appears to be necessary for good speech recognition (Pavlovic & Studebaker, 1984; Souza & Turner, 1999; Woods et al., 2013). As the input level increases, the gain is reduced to avoid loudness discomfort. To follow the temporal dynamics of speech, a compression system should respond rapidly to changes in the input level across time (Edwards, 2004; Moore, 2008; Souza, 2002). This requires the use of short time constants in the level estimation stage of the signal-processing chain (for implementation details, see Giannoulis et al., 2012; Kates, 1993). However, the application of short time constants can also lead to rapid fluctuations in the gain function over time, introducing potentially detrimental distortions of the temporal envelope of speech (e.g., Gatehouse et al., 2006; Jenstad & Souza, 2005, 2007; Plomp, 1988; Souza et al., 2012a; Walaszek, 2008). A number of studies have shown that fast-acting WDRC provides an improvement in audibility of speech sufficient to offset the potentially detrimental distortion of the temporal envelope of the signal, leading to a net intelligibility benefit. This was demonstrated for speech in quiet by Villchur (1973), Souza and Turner (1998, 1999), Souza and Bishop (1999), and Davies-Venn et al. (2009). An acoustic analysis conducted by Alexander and Rallapalli (2017) showed that fast-acting compression leads to a higher effective compression ratio (ECR, based on short-term level histograms1) compared with slow-acting compression. This has a positive effect on speech audibility but, on the other hand, negatively affects the speech modulation transfer function (MTF). The speech-recognition results reported in the same study suggest that, in many cases, the audibility benefit counteracts the negative effects of envelope-domain distortion.
When the target speech is degraded by background noise, the benefit of WDRC appears to depend on a variety of factors such as the spectrotemporal characteristics of the noise, the overall input level, and the signal-to-noise ratio (SNR), as demonstrated, for example, by Yund and Buckles (1995). Souza et al. (2006) demonstrated that the presence of background noise decreases the overall amount of envelope fluctuations, leading to less dynamic changes in the gain function and, as a result, a decreased ECR of speech. Rhebergen et al. (2009) reported beneficial effects of compression on the speech reception threshold (SRT) when the processing was applied to the speech alone prior to mixing it with the background noise. However, such conditions are rather artificial. Rhebergen et al. considered also a more realistic scenario, in which the processing was instead applied to the mixture of speech and either a stationary or a nonstationary, interrupted noise. In that case, compression had a pronounced beneficial effect on the SRT in the interrupted noise. Similar findings were reported in a later study by Rhebergen et al. (2017). At negative SNRs (as was the case in both studies of Rhebergen et al.), the interferer is the more dominant stimulus and its temporal fluctuations drive the compression system. The gain is increased during the dips in the noise, amplifying the low-level glimpses of speech present in those dips. The results of Desloge et al. (2017) and Kowalewski et al. (2018) further support the notion that fast-acting compression systems provide improved short-term audibility and increased opportunities for glimpsing, as long as the noise exhibits prominent fluctuations and the long-term input SNR is negative.
In contrast, in scenarios characterized by high long-term input SNRs, the compression is driven mostly by the changes of the speech level. The fast changes in gain cyclically amplify the background, introducing modulation components to the noise (Stone & Moore 2003, 2004, 2007, 2008) and reducing the long-term output SNR (Hagerman & Olofsson, 2004; May et al., 2018; Naylor & Johannesson, 2009; Rhebergen et al., 2009, 2017; Souza et al., 2006). Both effects are potentially detrimental to speech intelligibility and the perceived sound quality. Taken together, the previous findings indicate that fast-acting compression has rather positive effects on speech intelligibility due to increased audibility and a reduced dynamic range in the following scenarios: (a) speech in quiet, (b) speech in the presence of a strongly fluctuating noise at a negative SNR, and (c) speech compressed prior to mixing it with noise (unrealistic). These benefits are largely reduced, or turn into a detriment, as soon as the input SNR becomes positive (which is a common scenario, see Smeds et al., 2015; Weisser & Buchholz, 2019) and/or when the interferer is stationary. It is nevertheless possible that the advantages of fast-acting compression would be restored if a selective processing of the speech and the noise components could be achieved.
Several studies have focused on the effects of compression release time on listener’s subjective preference and/or perceived quality. Their conclusions are largely in line with the aforementioned studies on speech intelligibility. Neuman et al. (1995) investigated hearing-impaired (HI) listeners’ overall preference for the compression release time (60, 200, and 1000 ms) when processing speech in the presence of background noise of varying characteristics and levels. Overall, longer release times were preferred for the types of noise naturally characterized by higher sound pressure levels (SPLs). In a follow-up study using the same set of conditions (Neuman et al., 1998), the listeners rated several attributes of sound quality. The results indicated that, with longer compression release times, the ratings of the overall impression, pleasantness, and clarity increased, while the rating of noisiness decreased. This was likely due to the above-mentioned cyclical amplification of the background noise that occurs at positive input SNRs. The effect becomes more prominent with shorter release times (as more gain is provided to the noise during the speech gaps) and is more noticeable as the level of the background increases. A similar preference for longer release times was demonstrated by Hansen (2002) in a group of HI listeners and a range of acoustic scenarios. Neuman et al. (1995) suggested the use of an adaptive release time in hearing aids in order to improve the perceived sound quality. A shorter release time could be used in quieter scenarios, while a longer release time could be applied with increasing levels of background noise. Several adaptive compression strategies have been proposed in the past including the K-AMP (Killion et al., 1992), the dual front-end adaptive gain control (Moore & Glasberg, 1988), the guided level estimator (Neumann, 2008), and the short-term dynamic-range-driven system proposed by Lai et al. (2013). However, all of these systems rely on short-term level dynamics of the speech and noise mixture and do not explicitly utilize information related to the presence of the target signal with respect to the background noise.
The SNR-aware dynamic range compression strategy presented by May et al. (2018) attempts to combine the advantages of both fast- and slow-acting compression. The main idea is to adjust the release time of the compressor in each individual time–frequency (T-F) unit depending on whether the target is present or absent. Specifically, a short release time is applied to speech-dominated T-F units where the short-term SNR is high, while a longer release time is used to process T-F units that are dominated by noise. The SNR-aware compression strategy bears some similarities with the aforementioned artificially created scenario tested by Rhebergen et al. (2009), where the speech alone was compressed prior to mixing it with noise. The difference is that the SNR-aware approach operates on the noisy speech mixture and does not require the availability of separate speech and noise signals, making it potentially applicable in hearing devices. Similar principles had previously been applied in the compression system driven by the direct-to-reverberant energy ratio, which was shown to preserve the listeners’ spatial perception (Hassager et al., 2017). May et al. (2018) provided an instrumental evaluation of the SNR-aware compression strategy and compared it with conventional fast- and slow-acting compression. The SNR-aware compression strategy provided ECRs similar to those obtained with conventional fast-acting compression, while the natural fluctuations in the background noise were preserved in a similar way as when slow-acting compression was applied.
In this study, the SNR-aware dynamic range compression strategy was evaluated in terms of speech intelligibility and subjective preference in a group of HI listeners. It was hypothesized that the SNR-aware compression strategy would provide superior audibility compared with slow-acting compression, while it would result in a higher output SNR and introduce fewer distortions of the background compared with fast-acting compression, leading to superior speech intelligibility performance and higher preference scores. To exclude the potential effects of SNR estimation errors on perception, the ideal SNR-aware strategy based on the
Methods
Participants
The study included 17 HI listeners aged 25 to 80 years (average 68.7 years). All participants underwent screening conducted by a trained audiologist, which included tympanometry, pure-tone audiometry (air and bone conduction), and word recognition scores in quiet (discrimination scores) using the Dantale corpus (Elberling et al., 1989). Based on this evaluation, all listeners’ hearing loss was classified as sensorineural. The listeners’ audiograms were compared with the standard audiograms proposed by Bisgaard et al. (2010) and were further classified into three groups based on the smallest absolute distance criterion (in dB): seven listeners in the

Pure-Tone Audiograms of Listeners in the
Signal Processing and Fitting
All dynamic range compression systems were based on the short-time discrete Fourier transform using frames of 10 ms duration with 75% overlap and operated in seven independent octave-wide frequency channels with center frequencies ranging from 125 to 8000 Hz. The level estimation in each frequency channel was performed using a first-order infinite impulse response filter with different time constants associated with the attack and the release (Kates, 1993). As shown in Table 1, the following three compression systems were tested: conventional fast- and slow-acting compression as well as SNR-aware compression. The attack time in the level estimator was always set to 5 ms. The fast-acting system utilized a level estimator with a short release time of 40 ms, while it was set to 2000 ms for the slow-acting system. The level estimator in the SNR-aware system switched between the short and the long release time in individual T-F units by applying a threshold criterion of 0 dB to the
Configuration of the Three Tested Compression Schemes.
The compression thresholds (CTs) in each frequency channel were calibrated using a stationary noise with an SPL of 50 dB and a spectrum that was spectrally matched to the long-term average spectrum of the Danish hearing-in-noise test (HINT) corpus. Linear (level-independent) gain was applied below the CT. The linear gain and compression ratios (CRs) were calculated from the insertion gain for 50 and 80 dB SPL prescribed by the National Acoustic Laboratories Non-Linear 2 (NAL-NL2; Keidser et al., 2011) rationale. In the fitting software, the settings
Compression Thresholds (CTs) in dB and Compression Ratios (CRs) for Individual Channel Center Frequencies.
Stimuli and Procedure
Noisy speech sampled at a rate of 20 kHz was created by mixing clean speech from the Danish HINT corpus (Nielsen & Dau, 2011) with the following two noise types: the stationary International Collegium of Rehabilitative Audiology (ICRA)-1 noise (Dreschler et al., 2001) and the factory noise from the NOISEX database (Varga & Steeneken, 1993). The factory noise was a recording from an industrial production plant, consisting of various acoustic events, including machine and conveyor belt sounds, with a moderate degree of reverberation. It therefore contained natural spectrotemporal fluctuations, in contrast to the stationary background (which only contained intrinsic temporal fluctuations). The two noise types were chosen in order to investigate potential perceptual effects of spectrotemporal interactions between speech and the background. Both were spectrally matched to the long-term average spectrum of the HINT corpus measured in one-third octave bands. For each noisy speech mixture, a random noise segment was selected. A noise-only segment of 1 s duration was included before and after each sentence.
The administration of the tests and the preprocessing of stimuli were performed using a personal computer running M
SRT Determination
The experimental session began with measuring the SRT in each noise type using conventional fast-acting compression. Scoring was performed on a sentence basis, that is, a correct recall of all five words was required to mark the presented sentence as correct. Each list consisted of 20 sentences. Following the listener’s response to each sentence, the SNR for the next sentence was determined and stored (also following the last sentence on the list, yielding 21 stored SNRs, Nielsen & Dau, 2011). The start SNR was +5 dB. If the first sentence was not correctly identified, it was repeated with an increasing SNR until recalled correctly. The initial step size was 4 dB and was reduced to 2 dB after the first five sentences (Nielsen & Dau, 2011). The SRT was determined as the average of the SNRs from sentence 6 to 21. For each noise type, a training trial was conducted using the HINT training lists. Subsequently, two estimates of the SRT were made (test trials) using an HINT test list selected at random (without replacement). The final SRT value for each noise type was determined as a mean of the values obtained using the two test lists. The starting noise type was selected at random and the noise types were subsequently alternated.
Fixed-SNR Sentence-Recognition Scores
A sentence-recognition score was determined for each of the six conditions (2 Noise Types × 3 Processing Strategies). The SNR was fixed for each noise type and equal to the corresponding SRT, determined in the first part of the experiment. The order of the conditions was randomized for each listener. However, each test list was immediately preceded by a training list in the corresponding condition, in order to familiarize the listeners with the given combination of noise and processing type over a broad range of SNRs. The six HINT test lists remaining after SRT determination were selected at random (without replacement). The training lists were used with replacement, such that some of the training lists were experienced by the listeners multiple times in different conditions throughout the entire experiment.
Paired-Comparison Preference Test
For each of the two noise types, comparisons between all three processing types were made (six comparisons in total). Each listener completed 3 trials, for a total of 18 comparisons (except for 1 participant who completed only 2 trials or 12 comparisons).
Before each trial, three sentences from the HINT corpus were selected at random and concatenated to create a running speech sample. The sample was mixed with the background noise at the same SNR as used in the preceding sentence-recognition test. In each presentation, the speech-in-noise sample was processed with each of the processing strategies and presented to the listeners as
Results
Speech reception thresholds
The individual SRTs are shown in Figure 2 for each noise type as a function of the hearing profile (

Individual SRTs of all Listeners for ICRA-1 and Factory Noise as a Function of the Hearing Profile (
Sentence Scores
A rationalized arcsine units (RAUs) transform (Studebaker, 1985) was applied to the sentence-recognition scores expressed in percent correct. The RAU-transformed scores were averaged across listeners and are shown in Figure 3 as a function of the processing type (fast, slow, and SNR-aware compression) for the ICRA-1 noise (left panel) and the factory noise (right panel). Subsequently, a three-way, mixed-effects ANOVA was conducted on the transformed data. The fixed factors were

RAU-Transformed Sentence Recognition Scores Averaged Across Listeners as a Function of the Processing Type (Fast, Slow, and SNR-Aware Dynamic Range Compression) for ICRA-1 Noise (Left Panel) and Factory Noise (Right Panel). The error bars indicate the standard errors of the mean. RAU = rationalized arcsine units; SNR = signal-to-noise ratio; ICRA-1 = International Collegium of Rehabilitative Audiology.
The ANOVA revealed a large and significant main effect of

RAU-Transformed Sentence-Recognition Scores Averaged Across Listeners and Noise Types as a Function of the Processing Type (Fast-, Slow-, and SNR-Aware Dynamic Range Compression). The error bars represent the 95% confidence intervals (see the main text for details). Level of statistical significance of the difference of means is indicated as follows: * .05 or
Subjective Preference
For each noise type, data from 150 paired-comparison trials were collected (16 Listeners × 9 Trials + 1 Listener × 6 Trials). For each listener, the trials were evaluated for consistency in terms of transitivity, and the trials containing circular triads were rejected3(see Kendall, 1962; Kendall & Smith, 1940, for a detailed discussion). Overall, 111 of the 150 trials for the ICRA-1 noise and 120 of the 150 trials for the factory noise were considered for further analysis. For each noise type, the responses from the remaining trials were pooled together to create response matrices. These matrices are summarized in terms of the number of wins for each strategy in the top panels of Figure 5. Subsequently, the values in the response matrices were converted to relative frequency and evaluated for weak stochastic transitivity4 (Ellermeier et al., 2004). The weak stochastic transitivity was maintained for both noise types, which allowed to fit a more restrictive Bradley–Terry–Luce (BTL) model (Bradley & Terry 1952; Ellermeier et al., 2004; Luce, 1959). The BTL model was evaluated separately for each noise type using the M

Results of the Subjective Preference Test as a Function of the Processing Type (Fast-, Slow-, and SNR-Aware Dynamic Range Compression) for ICRA-1 Noise (Left Panels) and Factory Noise (Right Panels). The panels in the top row show the number of wins based on the consistent trials from all listeners. The panels in the bottom row show the corresponding ratio-scale values derived from the BTL model, including the 95% confidence intervals (see the main text for details). Level of statistical significance is indicated as follows: *.05, **.01, ***.001 or
Discussion
The purpose of this study was to conduct a perceptual evaluation of the novel SNR-aware compression strategy proposed by May et al. (2018) in HI listeners. Three audiometrically profiled groups were tested:
Compression Strategy
The ANOVA of sentence-recognition scores indicated a statistically significant main effect of processing type and no main effect of noise type. Moreover, the interaction between the noise type and the processing type did not reach statistical significance. However, the following trend was observed in the RAU-transformed sentence-recognition scores shown in Figure 3. In the ICRA-1 noise, it appears that there are almost no differences between the (averaged) scores. While a small advantage of fast- versus slow-acting compression was found in the factory noise condition, a larger advantage over either of the two conventional schemes was obtained with the SNR-aware processing scheme. Because the interaction was not statistically significant, the subsequent post hoc tests had to be conducted on scores pooled across noise types. Nevertheless, it appears that the pattern observed in the analysis might be
Compared with slow-acting compression, fast-acting compression of speech provides ECRs that are closer to the nominal CR prescribed by the gain rationale, resulting in improved audibility. The results of this study suggest that these acoustic effects are necessary (but not sufficient) for an improved speech recognition in noise. If conventional processing is applied, those positive effects are likely offset by a distortion of the noise modulation spectrum and a reduction of the long-term broadband SNR. To take full advantage of fast-acting compression, a differentiation between the target and the background is required, followed by applying some distinct processing to the two signal components (foreground vs. background). This is achieved by the SNR-aware compression strategy and seems to provide a more favorable balance between audibility and ECR improvement versus MTF- and SNR-distortion. Moreover, as mentioned earlier, the advantage of the SNR-aware strategy seems to be more pronounced in the factory noise condition. This could stem from the stronger interaction between the speech and the background noise due to natural envelope fluctuations occurring in the two signals. The SNR-aware compression strategy reduces this interaction which could be advantageous for speech recognition. However, this interpretation has to be treated with caution due to the weak statistical evidence supporting it.
The subjective preference scores indicated an advantage of the novel SNR-aware compression strategy over both conventional fast- and slow-acting processing for both noise types. In addition, an advantage of slow- over fast-acting compression was observed in the stationary ICRA-1 noise but not in the nonstationary factory noise. This suggests that the cyclical amplification has a more prominent negative effect on the perceived quality in stationary backgrounds. This is consistent with the conclusion drawn by Neuman et al. (1995), that the cyclical
Listener-Specific Factors
As expected, the SRT depended on the degree of hearing loss and was highest (the worst) in the
Limitations
The paired comparisons were conducted using noisy speech at a relatively low SNR, corresponding to the SRT. This allowed to measure both intelligibility and subjective preference in the same acoustic conditions. However, such conditions are not optimal for evaluating the overall sound quality, because listeners may not be able to focus on a broader range of attributes due to the low intelligibility. The listeners’ preference might, in fact, be confounded solely by the differences in intelligibility between the processing types. A potential solution would be to adjust the SNR individually for each processing type, that is, to measure the SRT for all processing types instead of measuring it only for the fast-acting compression, reflecting an
The frequency response of the headphone was equalized to have a flat response with reference to the ear-canal entrance, as described in the
Finally, the results presented in this study evaluated the ideal SNR-aware compression strategy based on the
Applicability to Real-World Scenarios
This study focused on the perceptual benefit of SNR-aware compression when processing noisy speech. This study did not take the effect of the overall SPL of the speech and noise components into account. The conditions were chosen to emphasize the influence of audibility on the outcome metrics; hence, a relatively low input noise SPL of 50 dB was selected. Hence, in many cases, the speech level was below normal conversational levels. It is possible that the balance between different cues provided by slow- and fast-acting compression would change at higher noise levels, which occur quite frequently in real-world scenarios (Smeds et al., 2015; Weisser & Buchholz, 2019). It would therefore be interesting to investigate a condition with a notably higher background noise SPL (i.e., 65 or 70 dB).
Another factor that is present in many real-world acoustic scenarios, but not considered here, is reverberation. To take advantage of fast-acting compression of the speech signal in even more realistic scenarios where both room reverberation and interfering noise are present simultaneously, it is necessary to update the speech detection stage (e.g., with the power spectral density estimator proposed by Kuklasiński et al., 2016). When dealing with multiple competing sound sources that are spatially separated, the detection of speech-dominated T-F units could alternatively be accomplished by the analysis of spatial cues (May et al., 2011).
Conclusion
A perceptual evaluation of the SNR-aware compression strategy proposed by May et al. (2018) was conducted in controlled laboratory conditions in a group of HI listeners. The strategy was shown to provide a speech intelligibility benefit in noise compared with conventional slow-acting compression and achieved a higher subjective preference compared with both conventional fast- and slow-acting compression schemes. Future research will characterize those listeners that benefit the most from this new compression scheme and will determine the applicability to a broader range of acoustic conditions.
Footnotes
Declaration of Conflicting Interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This research was supported by the Technical University of Denmark and Centre for Applied Hearing Research.
