Sage Journals: Discover world-class research

Abstract

Spectro-temporal modulation (STM) sensitivity has been proposed as a sensitive marker of speech intelligibility in challenging listening conditions, yet the underlying auditory mechanisms involved in STM detection remain incompletely understood. The present study measured STM detection thresholds in young normal-hearing and older hearing-impaired listeners and evaluated whether the revised Computational Auditory Signal Processing and Perception model (CASP) can account for individual performance. Thresholds were obtained for six modulation detection conditions, defined by combinations of spectral (0, 1, and 2 c/o) and temporal (4 and 12 Hz) rates. To individualize CASP, outer and inner hair cell loss estimates were obtained from audiometric and Adaptive Categorical Loudness Scaling (ACALOS) data. The results showed systematically elevated thresholds in older hearing-impaired listeners as compared to the young normal-hearing group, particularly at higher spectral rates. The model simulations reproduced overall threshold patterns, but substantially underestimated group differences and interindividual variability in the data. Moreover, the simulations showed limited sensitivity to estimates of outer and inner hair cell loss, supporting the idea that additional supra-threshold mechanisms contribute to STM deficits. While these findings demonstrate the potential of auditory models to predict STM performance, they also highlight the need for refined representations of peripheral and central processing to account for individual STM detection thresholds.

Keywords

auditory modeling spectro-temporal modulation processing supra-threshold processing sensorineural hearing loss

Introduction

Accurately characterizing the sources and consequences of individual hearing loss remains a central challenge in auditory science. A fundamental distinction is typically drawn between reduced audibility and supra-threshold deficits. Audibility limitations—commonly assessed with pure-tone audiometry or speech-in-quiet tests—reflect elevated detection thresholds and are often compensated for by amplification, restoring near-normal performance in quiet listening conditions in many cases (Plomp, 1978). Nevertheless, many listeners continue to experience difficulties when sounds are presented well above threshold, in the so-called supra-threshold conditions (Dreschler & Plomp, 1985; Glasberg & Moore, 1989; Lopez-Poveda, 2014; Plomp, 1978). These difficulties are especially pronounced in noisy or complex environments and cannot be explained by audibility alone, pointing to additional deficits in auditory processing. Despite decades of research, the precise mechanisms underlying supra-threshold deficits remain only partly understood.

To investigate these mechanisms, numerous studies have examined supra-threshold deficits in specific auditory domains. Temporal processing has been assessed using tasks such as gap detection, temporal modulation transfer functions (TMTFs), and temporal fine structure sensitivity (Buss et al., 2004; Lorenzi et al., 2006; Oxenham & Moore, 1997; Qin & Oxenham, 2003; Regev et al., 2023). Spectral processing has been probed with measures of frequency selectivity and spectral masking (Dreschler & Plomp, 1985; Festen & Plomp, 1983; Glasberg & Moore, 1990; van Schijndel et al., 2001), while binaural processing has been evaluated through sensitivity to interaural time and level differences, as well as binaural masking level differences (Gabriel et al., 1992; Hall et al., 1984). A central motivation for these studies has been to determine whether such measures can predict individual speech-in-noise performance (Dreschler & Plomp, 1985; Houtgast & Festen, 2008; Johannesen et al., 2016; Strelcyk & Dau, 2009; Thorup et al., 2016). While informative, these measures typically explain only a modest portion of the variance in speech reception thresholds (SRTs) among hearing-impaired (HI) listeners.

Spectro-temporal modulation (STM) processing has received particular attention as a key ability supporting speech understanding (Singh & Theunissen, 2003; Venezia et al., 2019). STM detection thresholds measure sensitivity to combined temporal and spectral fluctuations, and, unlike other psychoacoustic tasks, have shown robust predictive power for speech intelligibility beyond the audiogram (Bernstein et al., 2016, 2013; Zaar et al., 2024). This has motivated the development of STM-based diagnostics, most prominently the clinically viable Audible Contrast Threshold (ACT) test (Zaar et al., 2023, 2024). Early work by Chi et al. (1999) characterized STM detection in normal-hearing (NH) listeners, deriving spectro-temporal modulation transfer functions (MTFs) across a range of temporal (Hz) and spectral (cycles/octave, c/o) modulation rates. The MTFs showed low-pass characteristics in both dimensions, consistent with earlier findings on temporal resolution (Viemeister, 1979) and spectral resolution (Eddins & Bero, 2007; Green, 1986). Building on this, Bernstein et al. (2013) measured STM sensitivity in NH and HI listeners using a four-octave pink noise carrier modulated at combinations of temporal (4, 12, and 32 Hz) and spectral (0.5, 1, 2, and 4 c/o) rates. HI listeners showed reduced sensitivity at low temporal and high spectral rates, with the largest group difference at $4$ Hz and $2$ c/o. Importantly, thresholds in this condition predicted speech intelligibility in stationary speech-shaped noise even after controlling for audibility (Bernstein et al., 2013). Follow-up studies using STM variants confirmed this link (Mehraei et al., 2014), leading to the development of the ACT test, which evaluates STM detection for a 4 Hz, 2 c/o condition imposed on a 2.5-octave-wide pink-noise carrier low-pass filtered at 2 kHz.

Despite its diagnostic potential, the auditory mechanisms underlying STM detection are not yet fully understood. Evidence suggests that STM performance reflects multiple processes, with temporal fine structure (TFS) encoding contributing more strongly at lower frequencies and frequency selectivity playing a greater role at higher frequencies (Bernstein et al., 2013; Mehraei et al., 2014). Specifically, Bernstein et al. (2013) showed that STM performance using a broadband carrier was jointly predicted by sensitivity to low-rate (500 Hz) frequency modulation (FM)—linked to TFS processing—and by high-frequency (4 kHz) frequency selectivity measured with notched-noise masking. This finding provided initial evidence for the involvement of multiple contributing mechanisms. Building on this result, Mehraei et al. (2014) employed narrowband (one-octave) carriers centered at 0.5, 1, 2, and 4 kHz to probe frequency regions in which different mechanisms are expected to dominate. This approach enabled a more direct association of low-frequency STM performance with TFS-based temporal coding and high-frequency STM performance with frequency selectivity, thereby providing further support for a dual-mechanism account of STM detection. This interpretation is based on statistical associations between behavioral measures; however, the underlying physiological mechanisms linking these measures to STM performance remain to be established.

Rather than relying solely on statistical associations between STM thresholds and individual psychoacoustic measures, the present study adopts a computational modeling approach using the Computational Auditory Signal Processing and Perception model (CASP, Jepsen & Dau, 2011; Jepsen et al., 2008; Paulick et al., 2025). Auditory models such as CASP provide a framework for identifying which auditory processing stages may limit performance in a given task and how hearing loss may alter these stages. A model that can predict individual STM thresholds has the potential to both help clarify the underlying mechanisms and to support the development of personalized diagnostic and compensatory strategies.

Earlier modeling work introduced the STM index (STMI), which combines a peripheral transformation with an explicit spectro-temporal analysis using a bank of modulation-selective filters tuned to different scales and rates. This model was consistent with TMTFs measured in NH listeners and successfully predicted speech intelligibility under various distortions (Chi et al., 1999; Elhilali et al., 2003). In contrast, CASP implements separate spectral and temporal modulation filterbanks, simulating nonlinear frequency selectivity and temporal modulation frequency selectivity, respectively. The present study evaluates whether CASP can account for STM detection thresholds without explicit spectro-temporal tuning. By doing so, we aim to determine the extent to which STM sensitivity can be explained by core auditory processing stages, as implemented in CASP, and to identify which aspects of hearing loss most strongly influence this ability.

The CASP model simulates auditory processing through a cascade of peripheral and central stages, including nonlinear frequency selectivity, adaptation mechanisms, and temporal modulation frequency selectivity, followed by a decision stage based on an optimal detector designed for n-alternative forced choice (AFC) paradigms (Green & Swets, 1988). CASP has successfully predicted performance in a wide range of psychoacoustic tasks with NH listeners (Jepsen et al., 2008; Paulick et al., 2025) and has been extended to capture average effects of hearing loss (Jepsen & Dau, 2011). In addition, a speech-based variant of the model (sCASP, Relaño-Iborra et al., 2019) has been applied to predict speech intelligibility in noise.

In the present study, we measured STM detection thresholds in two listener groups: young NH (yNH) listeners and older HI (oHI) listeners. We tested a subset of the STM conditions used by Bernstein et al. (2013), selecting temporal modulation rates of 4 and 12 Hz and spectral modulation rates of 1 and 2 c/o. All STM stimuli were imposed on a low-frequency pink-noise carrier, following the carrier configuration of the ACT paradigm (Zaar et al., 2024). The ACT paradigm itself corresponds specifically to the 4 Hz, 2 c/o modulation condition. For comparison, we also measured pure temporal (AM) detection thresholds at the same temporal rates using the same carrier.

To further characterize auditory processing in the oHI group, participants completed the Adaptive Categorical Loudness Scaling test (ACALOS, Brand & Hohmann, 2002), which provided individual estimates of nonlinear loudness growth. These parameters were incorporated into CASP to individualize the model’s front-end stages. Following established modeling frameworks (Chalupper & Fastl, 2002; Jürgens et al., 2011; Moore & Glasberg, 1997; Stefan et al., 2019), the overall audiometric hearing loss was decomposed into contributions from outer hair cell (OHC) and inner hair cell (IHC) dysfunction. This individualized modeling enabled us to simulate STM performance for each listener and to evaluate whether CASP can account for the observed across-listener variability.

Experimental Methods

Listeners

Ten yNH listeners (4 female, age range: 22–30 years, mean age = 23.9 years) and 10 oHI listeners (2 female, age range: 54–80 years, mean age = 74.2 years) participated in the study. All listening tests were conducted monaurally. The oHI listeners had sensorineural hearing loss in both ears with no history of conductive impairment, and the tested ear was randomly selected. The yNH listeners had audiometric thresholds $\leq 20$ dB HL at octave frequencies from $250$ Hz to $8$ kHz in the tested ear. Individual and mean audiograms of the tested ears are shown in Figure 1. All measurements were conducted without hearing-aid amplification. Participants provided written informed consent and were compensated for their time. The study protocol was approved by the Science Ethics Committee of the Capital Region of Denmark (Reference No. H-16036391).

Figure 1.

Audiometric thresholds for the tested ears of the young normal-hearing (yNH, left panel) and older hearing-impaired (oHI, right panel) listeners. Individual thresholds are shown in gray, with the mean and standard deviation across listeners shown in black.

Procedure and Apparatus

All listening tests were conducted in single-walled, soundproof booths. Stimuli were generated in MATLAB at a sampling rate of $48$ kHz, converted to analog via an RME Fireface soundcard (Haimhausen, Germany), and presented monaurally through Sennheiser HDA200 headphones (Wedemark, Germany). Output levels were calibrated using a GRAS RA0039 ear simulator (Holte, Denmark). NH listeners completed one session in which they underwent pure-tone audiometry, otoscopy, and the AM and STM detection tasks. HI listeners completed two sessions in which they underwent the same tests, with the addition of the Adaptive Categorical Loudness Scaling (ACALOS) test. The test order was randomized across listeners, except for audiometry and otoscopy, which was always conducted at the beginning of the first session.

(Spectro)-Temporal Modulation Detection Task

An STM stimulus consists of a carrier signal modulated by a combination of spectral and temporal modulations, producing upward- or downward-moving ripples in the spectrum. The stimulus followed the design described by Zaar et al. (2023, 2024). The carrier was generated in the frequency domain by summing 2,499 random-phase tones, equally spaced along a logarithmic frequency axis spanning $2.5$ octaves ( $354$ – $2, 000$ Hz). This logarithmic spacing yields an approximately pink (1/f) spectrum, with equal power per octave. STM thresholds were measured for four modulator conditions, defined by two spectral ripple rates ( $1$ and $2$ c/o) combined with two temporal rates ( $- 4$ and $- 12$ Hz). Negative temporal rates correspond to upward-moving ripples. The modulator signal applied to the carrier was defined as follows:

M (x, t) = m \sin [2 π (f_{temp} t + f_{spec} x) + ϕ],

(1)

where

m

is the modulation depth,

f_{temp}

is the temporal rate,

f_{spec}

is the spectral rate, and

ϕ

is the initial phase. A more detailed mathematical treatment can be found by Chi et al. (1999) and Zaar et al. (2023). For comparison, pure temporal modulation thresholds (AM) were measured for the same two temporal rates. These stimuli were generated identically to the STM stimuli, except that the spectral modulation rate was set to zero.

STM thresholds were measured using a three-interval, three-alternative forced-choice (3I-3AFC) procedure with a 1-up-2-down tracking rule, targeting the $70.7$ % point on the psychometric function (Levitt, 1971). The procedure was implemented in MATLAB using the AFC-Toolbox 1.40 (Ewert, 2013). Each trial consisted of three intervals: two contained the unmodulated carrier and one contained the modulated carrier. The starting phase $ϕ$ was randomly drawn from a uniform distribution in $[- π, π]$ at the beginning of each run and held constant (“frozen”) across all trials within that run. This was a modeling-driven choice necessitated by the use of fixed templates: in the current implementation, reliable detection requires that the modulation phase of the stimulus matches that of the template, and randomizing the starting phase on every trial would, therefore, prevent detection. Each signal had a duration of $1$ s, including $50$ -ms raised-cosine onset and offset ramps, with $500$ ms of silence between intervals. All stimuli were presented at a fixed level of $85$ dB SPL. The adaptive variable was modulation depth, defined as $M = 20 * \log_{10} (m)$ in dB, starting at $M = - 4$ dB. The initial step size was $4$ dB, reduced to $2$ dB after the second reversal, and further reduced to a minimum of $1$ dB. The threshold for each run was calculated as the mean modulation depth across the last six reversals at the minimum step size.

Thresholds were measured in three blocks (order randomized) defined by spectral modulation rate: pure AM (0 c/o), $1$ c/o, and $2$ c/o. Within each block, temporal rates of $4$ and $12$ Hz were tested in an interleaved manner, yielding six conditions in total. Fully modulated examples of these stimuli (M = 0 dB) are shown in Figure 2. Each condition was measured three times within the block, and the final threshold was computed as the mean across repetitions. Additional runs were performed if (i) a threshold was not reached, (ii) the standard deviation within a run exceeded $5$ dB, or (iii) the standard error across repetitions exceeded $2$ dB.

Figure 2.

Auditory spectrograms of the stimuli used in the AM and STM detection tasks. Spectral ripple rates (0, 1, and 2 c/o; horizontal dimension) were combined with temporal modulation rates (4 and 12 Hz; vertical dimension). All modulations were applied to a 2.5-octave-wide noise carries with an upper cutoff frequency of 2 kHz. Spectrograms are shown for fully modulated stimuli (M = 0 dB). During testing, modulation depth was adaptively varied to estimate the detection threshold for each AM and STM condition. AM = amplitude modulation; STM = spectro-temporal modulation.

Participants received visual feedback after each trial indicating whether their response was correct. For STM conditions, one training run was completed per STM combination. For AM conditions, one training run was completed for the $4$ -Hz condition. If no threshold was reached in the initial training run, a second training run was provided.

Adaptive Categorical Loudness Scaling (ACALOS)

The ACALOS test (ISO 16832, 2006) was administered following the procedure described by Brand and Hohmann (2002). Stimuli were one-third-octave band noises with center frequencies of $0.25$ , $0.5$ , $1$ , $2$ , $4$ , and $6$ kHz. Each stimulus had a duration of $1$ s with $50$ ms raised-cosine onset and offset ramps. Participants rated the loudness of each stimulus on an 11-point categorical scale ranging from “inaudible” to “extremely loud.” The ACALOS procedure is limited to a maximum presentation level of 105 dB HL. Within each run, stimuli of different center frequencies were presented in randomized, interleaved order. Each participant first completed a training run, followed by a second run from which the final results were obtained. For details on the ACALOS procedure, the reader is referred to Brand and Hohmann (2002).

Statistical Analysis

Statistical analyses were performed on the measured (spectro)-temporal modulation detection thresholds using a repeated-measures analysis of variance based on fits of a linear mixed-effects model. The model included group (yNH vs. oHI), temporal modulation rate, and spectral modulation rate as fixed effects, and listener as a random effect. All main effects and their interactions were tested. Model assumptions were checked using Levene’s test for homogeneity of variance and the Shapiro-Wilks test for normality of residuals. Significant interactions were followed up with Holm-Bonferroni-corrected post-hoc pairwise comparisons. The significance level was set to 0.05 for all analyses.

Auditory Processing Model

The preprocessing stages of the CASP model are illustrated in Figure 3 and follow the framework described by Jepsen et al. (2008) and Paulick et al. (2025). Briefly, the input signal passes through outer- and middle-ear filters, followed by a dual-resonance nonlinear filterbank (DRNL, Lopez-Poveda & Meddis, 2001) that simulates level-dependent frequency selectivity. The filterbank output is then processed by a nonlinear IHC stage, introduced by Paulick et al. (2025), which captures saturation at high sound pressure levels. This is followed by an adaptation stage consisting of five cascaded feedback loops (Püschel, 1988). Finally, the signal is decomposed into modulation subbands using a temporal modulation filterbank, yielding a three-dimensional internal representation with dimensions of time, auditory frequency, and modulation frequency. The final preprocessing stage limits access to modulation-phase information at higher temporal rates (above $10$ Hz; Dau et al., 1997; Jepsen et al., 2008); details of this stage are provided in the Appendix.

Figure 3.

Preprocessing stages of the CASP model that construct the three-dimensional internal representation. Sensorineural hearing loss is simulated via (i) outer hair cell (OHC) loss, implemented as a modification of the broken-stick nonlinearity in the dual-resonance nonlinear (DRNL) filterbank, and (ii) inner hair cell (IHC) loss, implemented as an attenuation at the output of the IHC stage.

The model backend consists of an optimal detector designed for $n$ -AFC paradigms. The model follows the same adaptive procedure as the listeners and makes decisions on each trial based on correlations between a template and the interval representations. Unlike Paulick et al. (2025), who used a supra-threshold modulation depth of $M = - 6$ dB for template generation, we used $M = 0$ dB for the present STM task. While this choice has negligible effects for temporal modulation detection, the additional spectral dimension of STM stimuli makes model performance more dependent on the modulation depth used to construct the template. At lower supra-threshold depths, internal representations of STM stimuli are noisier, which degrades the resulting template and reduces detection performance. Using a fully modulated ( $M = 0$ dB) template therefore provides a more robust template for STM detection. Templates were averaged across 15 presentations, as by Paulick et al. (2025). Template generation was performed within-subject, under the assumption that each simulated HI listener forms an individual template. The model also incorporates internal noise that limits resolution. Following Dau et al. (1996), internal noise is not added to the internal representations but is implemented as a fixed resolution limit at the decision stage. Detection occurs when the correlation exceeds a criterion defined by a constant internal noise variance. This variance is estimated under a fitting condition using an intensity-discrimination task that matches just-noticeable differences at 60 dB SPL for a 1-kHz tone and broadband noise (Paulick et al., 2025). This calibration was performed in the NH configuration and subsequently kept constant across tasks and simulated HI listeners.

Among the preprocessing stages, the DRNL filterbank plays a central role in simulating sensorineural hearing loss (SNHL). The DRNL filterbank implements level-dependent frequency selectivity using two parallel pathways: a linear path controlled by a gain parameter, and a nonlinear path incorporating a broken-stick nonlinearity. Together, these pathways shape the DRNL input–output (I/O) function:

y = g \cdot x + sign (x) \cdot \min (a \cdot | x |, b \cdot | x |^{c}),

(2)

where

g

sets the slope of the linear region (dominating at higher levels),

c

is the compression exponent, and

a

and

b

determine the knee point where compression begins.

To simulate SNHL within this framework, it is assumed that the total audiometric hearing loss can be separated into contributions from OHC and IHC dysfunction (Jepsen & Dau, 2011; Lopez-Poveda & Johannesen, 2012; Moore & Glasberg, 1997):

{HL}_{tot} = {HL}_{OHC} + {HL}_{IHC},

(3)

with both terms expressed in dB. This formulation assumes that audiometric thresholds reflect SNHL alone, excluding other sources of impairment. The OHC-related component,

{HL}_{OHC}

, reflects a loss of cochlear gain, typically observed in basilar-membrane input–output (BMIO) functions or loudness growth functions. Earlier CASP work (Jepsen & Dau, 2011) estimated this parameter from BMIOs derived from temporal masking curves (Lopez-Poveda et al., 2003; Plack & Oxenham, 1998). In the present study,

{HL}_{OHC}

was instead estimated from ACALOS data using the procedure described by Jürgens et al. (2011). Briefly, individual loudness functions were derived from ACALOS and fitted with a dynamic loudness model (Chalupper & Fastl, 2002) that incorporates the two-component framework of hearing loss, separating OHC- and IHC-related contributions. The model simulates loudness growth for different proportions of OHC impairment, and a fitting algorithm determines the proportion that best matches the measured data. This approach yields time-efficient estimates consistent with those from TMCs (Jürgens et al., 2011).

Based on the estimated ${HL}_{OHC}$ , the DRNL I/O function (Equation 2) was then modified. Lowering $a$ reduces low-level gain, raising the compression knee point, while lowering $b$ extends compression to lower levels. Because many parameter combinations can yield similar I/O functions, OHC-driven modifications were restricted to the $a$ parameter, keeping $b$ , $c$ , and $g$ fixed. This choice confines the effects of OHC loss to a reduction of gain in the nonlinear pathway at low to mid input levels. The resulting modification follows established implementations of OHC loss (Lopez-Poveda & Johannesen, 2012; Plack et al., 2004):

a_{HL} = a_{NH} \cdot 10^{- {HL}_{OHC} / 20},

(4)

where

a_{NH}

is the normal-hearing value and

a_{HL}

the adjusted value for OHC loss. The residual hearing loss, after accounting for

{HL}_{OHC}

, was attributed to IHC dysfunction,

{HL}_{IHC}

, and implemented as an attenuation applied to the IHC stage outputs.

Select adjustments were made to the model implementation compared to that described by Paulick et al. (2025), specifically in the DRNL filterbank parametrization, the modulation phase sensitivity implementation and backend metric. Details regarding these adjustments, as well as a backwards-compatibility assessment using the same benchmark experiments as by Paulick et al. (2025), are given in the Appendix.

Results

Experimental Results

Modulation detection thresholds for the yNH listeners (blue) and the oHI listeners (red) are shown in Figure 4 as a function of spectral rate (0, 1, and 2 c/o) at two temporal rates: 4 Hz (left panel) and 12 Hz (right panel). The statistical analysis revealed significant main effects of group ( $F_{1, 18} = 23.65, p < .001$ ), temporal rate ( $F_{1, 90} = 27.36, p < .001$ ), and spectral rate ( $F_{2, 90} = 163, p < .001$ ). Significant two-way interactions were also observed between temporal and spectral rates ( $F_{2, 90} = 12.18, p < .001$ ) and between spectral rate and group ( $F_{2, 90} = 41.9, p < .001$ ). Neither the interaction of temporal rate and group nor the three-way interaction reached significance. Post-hoc analyses confirmed that oHI listeners had significantly higher (worse) thresholds than yNH listeners overall ( $p < .001$ ). Group differences, expressed as the modulation depth difference $Δ M$ , were especially pronounced at the highest spectral rate, that is, at 2 c/o at 4 Hz ( $Δ M = 7.43$ dB, SE = $0.99$ dB) and 2 c/o at 12 Hz ( $Δ M = 5.62$ dB, SE = $0.99$ dB). These results indicate that both temporal and spectral rates significantly affect STM detection, and that oHI listeners are disproportionately affected by increases in spectral rates.

Figure 4.

Boxplots of (spectro)-temporal modulation detection thresholds for yNH (blue) and oHI (red) listeners across temporal and spectral rate combinations. Left: Thresholds at 4 Hz as a function of spectral rate (0, 1, and 2 c/o). Right: Thresholds at 12 Hz. Individual data are shown as crosses. In boxplots, boxes represent the interquartile range (IQR; 25th–75th percentiles), central lines indicate medians, and whiskers extend to 1.5 $\times$ IQR. Stars indicate the significance level of group differences (* $p < .05$ , ** $p < .01$ , *** $p < .001$ ). yNH = young normal-hearing; oHI = older hearing-impaired.

Model Predictions

The top panels of Figure 5 show AM and STM thresholds for yNH listeners (dark-blue circles) alongside CASP predictions (light-blue circles). The model captured the general trends observed in the data, namely lower STM thresholds at 4 Hz compared to 12 Hz and a monotonic increase in thresholds with spectral rate at 12 Hz. However, at 4 Hz, the model overestimated AM sensitivity and predicted a performance drop from AM to STM that was not observed in the data, thereby exaggerating the AM-STM difference. Quantitatively, predictions for NH listeners yielded a mean absolute error (MAE) of 2.18 dB and a strong correlation with empirical means across conditions (Pearson’s $r = 0.89$ ). The largest deviation (5.13 dB) occurred in the 4 Hz AM condition.

Figure 5.

Top: AM and STM thresholds for yNH listeners (dark-blue symbols with errorbars) and CASP predictions (light-blue symbols with errorbars). Error bars represent the SD across repeated simulations. The model achieved a mean absolute error of 2.18 dB and correlated strongly with listener data ( $r = .89$ ). Bottom: AM and STM thresholds for oHI listeners (dark-red, individual data shown as crosses) and CASP predictions (light-red, individual data shown as crosses). Across conditions, HI predictions yielded an MAE of $1.81$ dB and a strong correlation with group averages ( $r = .98$ ). For all boxplots, boxes represent IQRs, central lines medians, and whiskers extend to 1.5 $\times$ IQR. AM = amplitude modulation; STM = spectro-temporal modulation; CASP = Computational Auditory Signal Processing and Perception model; yNH = young normal-hearing; SD = standard deviation; oHI = older hearing-impaired; HI = hearing-impaired; MAE = mean absolute error; IQR = interquartile range.

The bottom panels of Figure 5 show measured and predicted thresholds for oHI listeners. Predictions relative to the group average yielded an MAE of $1.81$ dB and a strong correlation with empirical means ( $r = .98$ ) across the six conditions. The largest deviation ( $3.82$ dB) occurred in the 4 Hz, 2 c/o condition. Despite this close match on average, the model underestimated both the variability in the listener data and the magnitude of the group effect. This discrepancy is illustrated in Figure 6, which shows mean group differences in threshold ( $Δ M = M_{HI} - M_{NH}$ ) across conditions. Behaviorally, HI listeners performed similarly to NH listeners in AM detection but showed elevated thresholds in STM conditions, with group differences of up to $\sim 7.43$ dB. In contrast, CASP predicted only modest threshold elevations (up to $\sim 2.33$ dB), primarily in the 4 Hz conditions.

Figure 6.

Group differences in modulation depth at threshold ( $Δ M = M_{HI} - M_{NH}$ ). Results are shown for human listeners (filled symbols) and CASP predictions (empty symbols) across AM and STM conditions at $4$ Hz (triangle) and $12$ Hz (circle). While behavioral data revealed minimal group differences for AM detection but substantial threshold elevations for HI listeners in STM conditions (up to $\sim 7.43$ dB), the model predicted only modest differences between groups (up to $\sim 2.33$ dB), with the largest effects observed in the 4 Hz conditions. HI = hearing-impaired; NH = normal-hearing; CASP = Computational Auditory Signal Processing and Perception model; AM = amplitude modulation; STM = spectro-temporal modulation.

Figure 7 shows scatter plots of predicted versus measured thresholds for individual HI listeners across AM and STM conditions at 4 Hz (left) and 12 Hz (right). Group-averaged NH thresholds and model predictions are additionally plotted for reference. On an individual level, the model particularly fails to accurately predict thresholds from the worst performers in the STM tasks, that is, those with the highest thresholds. Overall, when pooling across conditions, predictions correlated strongly with the behavioral data ( $r = .87$ , $p < .001$ ), indicating that CASP captured across-condition variance. However, within individual conditions, correlations did not remain significant after correction for multiple comparisons (see Table 1), although it should be noted that the limited sample size (N = 10) may have reduced statistical power. Overall, the strong pooled correlation appears to be driven by the model’s ability to capture between-condition differences rather than within-condition, across-listener variability. Thus, while CASP accounts well for group-averaged performance, it fails to explain individual differences in detection thresholds.

Figure 7.

Predicted versus measured thresholds for individual HI listeners (red) across AM and STM conditions at 4 Hz (left) and 12 Hz (right). Group-averaged NH thresholds and model predictions (blue) are shown for a reference. Condition types (AM, 1 c/o, and 2 c/o) are indicated by markers. The dashed line indicates perfect agreement. Pooled across conditions, predictions correlated strongly with the data ( $r = .87$ , $p < .001$ ). Within-condition correlations were weaker and not significant after correction (see Table 1). HI = hearing-impaired; NH = normal-hearing; AM = amplitude modulation; STM = spectro-temporal modulation.

Table 1.

Pearson’s Correlations ( $r$ ) and Corresponding $p$ -Values, Corrected for Multiple Comparisons, Between Measured and Predicted Thresholds for Individual HI Listeners in the Different Conditions.

Condition	$r$	$p_{corr}$
4 Hz, AM	$- .36$	1.830
4 Hz, 1 c/o	$.20$	3.474
4 Hz, 2 c/o	$.64$	.282
12 Hz, AM	$.34$	2.046
12 Hz, 1 c/o	$.61$	.354
12 Hz, 2 c/o	$.10$	4.740
Overall (all conditions)	$.87$	$< .001$

Abbreviations: HI = hearing-impaired; AM = amplitude modulation

Discussion

This study evaluated the ability of the CASP model to predict STM detection thresholds in yNH and oHI listeners. The model reproduced several aspects of the measured data, including the higher sensitivity (i.e., lower thresholds) to lower temporal rates in the STM condition. However, discrepancies with the human data were also observed, particularly the inability to capture the stable performance of yNH listeners across spectral rates at the lower temporal modulation frequency, the pronounced STM deficits at higher spectral rates, and the variability across oHI listeners. The simulations provide insights into potential mechanisms contributing to STM detection in both NH and HI listeners, while also highlighting specific auditory processes that warrant further investigation.

For yNH listeners, STM detection performance was stable across spectral densities at 4 Hz but declined with increasing spectral density at 12 Hz. This pattern is consistent with earlier reports of robust low-rate modulation sensitivity and increasing difficulty when spectral and temporal modulations are combined at higher rates (Bernstein et al., 2013). Two mechanisms have been proposed to explain this behavior: (i) frequency selectivity, which determines the extent of spectral smearing across channels as density increases (Bernstein et al., 2013; Mehraei et al., 2014), and (ii) TFS coding, which may provide cues for detecting slow frequency modulations at low carrier frequencies (Bernstein et al., 2013; Mehraei et al., 2014; Moore & Sek, 1996; Moore & Skrodzka, 2002). From this perspective, good performance at 4 Hz reflects reliance on TFS-based coding of spectral-peak fluctuations, whereas the rise in thresholds at higher temporal rates occurs once TFS cues are no longer available and listeners must rely on AM cues that are degraded by limited frequency resolution.

The CASP model predictions were broadly consistent with the observed NH data but offered a different mechanistic interpretation. The model captured the pattern observed in the human data whereby STM detection thresholds were lower at 4 Hz than at 12 Hz and increased monotonically with spectral rate at 12 Hz. Importantly, these trends emerged not from an explicit TFS-processing mechanism, but from limitations in modulation-phase sensitivity implemented in the model. At lower temporal modulation rates (below $\sim 10$ Hz), residual phase cues remained accessible at the output of the model, supporting robust performance across spectral densities. At higher temporal rates, these phase cues were lost, and the model operated primarily on modulation power. When spectral and temporal modulations were combined, auditory filtering in the model introduces smearing across adjacent channels, reducing effective modulation depth within each channel and thereby lowering the signal-to-noise ratio at the temporal modulation-filter outputs. In the 12 Hz conditions, the CASP predictions directly reflected this loss of within-channel modulation power, producing lower thresholds in STM compared to AM. Thus, the model suggests that limits in temporal modulation-phase sensitivity—rather than access to carrier-level TFS—may be sufficient to explain NH performance.

The 10-Hz phase-sensitivity cutoff implemented in the model was motivated by findings from Dau (1996) who tested three normal hearing listeners’ ability to discriminate $180 \circ$ shifts in the starting phase of sinusoidal modulations at different modulation frequencies. Sensitivity was high at very low rates ( $\sim 3$ Hz) but declined steeply with increasing frequency, reaching chance performance by $12$ Hz. However, broader datasets on envelope phase sensitivity remain scarce, and existing evidence suggests substantial interindividual variability (Sheft & Yost, 2007), leaving open questions about the precise cutoff frequency and its generality across listeners. In the present model, predictions were sensitive to the choice of this cutoff, suggesting that variability in modulation-phase sensitivity could plausibly contribute to the spread in STM thresholds, although this cannot be evaluated directly with the present data. Moreover, the physiological mechanisms underlying this limitation are still unidentified. Further systematic investigation is therefore needed to establish firmer constraints on this stage and to clarify its potential role in STM processing and speech perception.

Notably, the model overestimated sensitivity, that is, predicted lower thresholds, in the 4-Hz AM condition, likely because it benefited from phase cues that human listeners either do not access or do not exploit in simple AM detection tasks. Thus, any benefit from phase information seems to arise primarily in conditions where modulation varies across frequency, such as STM. The model, by contrast, treats all phase cues as informative, regardless of their distribution across channels. The failure to reproduce flat performance across spectral rates at 4 Hz, therefore, likely reflects an overestimation of AM sensitivity rather than an underestimation of STM thresholds. Incorporating a stage that applies modulation-phase sensitivity selectively—enhancing detection only when there is meaningful across-frequency coherence—would likely improve predictions and bring them closer to human performance.

For HI listeners, AM detection was broadly comparable to that of NH listeners, consistent with previous studies (Moore & Glasberg, 2001; Regev et al., 2024; Schlittenlacher & Moore, 2016; Wallaert et al., 2017; Wiinberg et al., 2019). In contrast, STM detection was substantially impaired, with threshold elevations of up to 7.43 dB relative to the NH group. CASP simulations, however, predicted only modest elevations (up to 2.33 dB) and showed little variability across individualized predictions. This mismatch suggests that the current model does not fully capture the mechanisms underlying STM deficits in HI listeners, particularly for those performing worst in the STM task. This limitation may reflect missing processing stages in the model, or shortcomings in how hearing loss is individualized. Simulated thresholds were relatively insensitive to estimates of OHC and IHC losses, with only large amounts of IHC loss producing measurable effects. This indicates that the model primarily reflected audibility limitations.

The limited influence of simulated OHC loss likely stems from the high presentation level (85 dB SPL). At such levels, the DRNL filterbank is dominated by its linear pathway, reducing frequency selectivity even in the NH configuration (Lopez-Poveda & Meddis, 2001). The OHC-loss parameter decreases nonlinear gain, effectively lowering the level at which the linear path starts to dominate. This broadens filters at low to mid levels but has little effect at higher levels, where tuning is already governed by the unchanged linear path in both NH and HI simulations. This behavior is consistent with psychophysical findings showing minimal differences in frequency selectivity between NH and HI listeners at high presentation levels (Carney & Nelson, 1983; Dubno & Schaefer, 1991; Florentine, 1978). Low-pass filtering of the noise carrier at 2 kHz further limited the expected influence of OHC loss. The DRNL constrains the extent of implementable OHC loss at low frequencies, consistent with evidence for reduced cochlear gain in apical regions (Plack et al., 2008; Robles & Ruggero, 2001), making extensive OHC-related loss less likely in this range. Consequently, most of the simulated hearing loss under the present stimulus conditions was attributed to IHC dysfunction (Johannesen et al., 2014). This interpretation is also consistent with previous STM studies (e.g., Mehraei et al., 2014), which emphasized the role of frequency selectivity primarily at higher carrier frequencies not tested here. Taken together, these considerations indicate that differences in frequency selectivity between NH and HI listeners were unlikely to have substantially influenced STM detection thresholds in this study.

Instead, IHC loss emerged as the primary determinant of elevated thresholds in the simulations, although the variability introduced by this parameter remained limited. The impaired DRNL sets the operating point along the IHC nonlinearity, but the inputs reaching the IHC stage after cochlear transformation showed restricted variability under the present stimulus conditions. IHC loss is then implemented as an attenuation applied to the IHC output. This approach captures audibility limitations for large losses but appears insufficient to account for more subtle supra-threshold deficits. The precise ways in which IHC loss alters the IHC input–output function remain incompletely understood (Patra et al., 2024). Moreover, the IHC-loss parameter functions as a catch-all for residual threshold elevation not explained by OHC contributions, and therefore likely conflates multiple impairments rather than providing a physiologically specific description of IHC dysfunction.

The role of other possible impairments, such as possible effects of neural deafferentation, remains unclear. Existing models of transduction (Lopez-Poveda, 2014) could be integrated into the present framework to explore their potential impact on STM encoding. Importantly, modulation-phase sensitivity—critical for accounting for the low-rate advantage in STM conditions for NH listeners—may also contribute to deficits observed in HI listeners. In the CASP model, current OHC and IHC impairments do not alter the representation of modulation phase or the integration of cues across frequency, which may explain why STM thresholds were only minimally affected. Modeling studies of speech perception have shown that explicitly analyzing coherence across auditory channels improves intelligibility predictions under phase-distorted noisy conditions (Chabot-Leclerc et al., 2014; Elhilali et al., 2003; Relaño-Iborra et al., 2019). However, it remains unclear how hearing loss or ageing may affect modulation-phase sensitivity or across-frequency integration. Extending CASP to include impairments in modulation-phase processing could, therefore, provide a mechanistic account of the pronounced STM deficits in HI listeners, but targeted empirical studies are needed to inform such model developments.

Several limitations of the present study should be considered when interpreting the findings. A first limitation concerns the use of a fixed presentation level (85 dB SPL) without amplification for HI listeners, leading to differences in sensation level across listeners. This was a design choice aimed at facilitating interpretation of the modeling results. By keeping the acoustic input identical across listeners, differences in model predictions could be attributed unambiguously to the simulated hearing loss rather than to individualized stimulus levels. In contrast, equating sensation levels would have introduced additional individualized stimulus differences that are difficult to disentangle within the modeling framework. However, this choice may have reduced audibility at certain frequencies for some HI participants, potentially leading to an underestimation of their STM sensitivity. Nevertheless, measurable thresholds were obtained for all listeners in all conditions, and the range of STM thresholds was comparable to that reported in studies using amplification (Zaar et al., 2023). This consistency further supports the view that STM detection captures aspects of supra-threshold auditory processing beyond the audiogram. Elevated presentation levels can also negatively affect modulation sensitivity in NH listeners (Magits et al., 2019). To address these issues, Zaar et al. (2023) applied individualized linear amplification in STM tasks, ensuring adequate audibility while avoiding unnecessarily high levels and providing more ecologically valid listening conditions. Future work should consider adopting this approach and testing whether the model can predict STM thresholds when audibility is systematically restored. Comparisons across unaided listening, individualized linear amplification, and more realistic hearing-aid processing with nonlinear gain would provide a more stringent and ecologically relevant evaluation of the model’s predictive power. Furthermore, unlike in previous STM paradigms the modulation starting phase was fixed within each run and randomized across runs. This modeling choice was required by the use of fixed templates, as the current implementation cannot reliably detect modulation when stimulus and template phases are mismatched. Although this design could, in principle, allow listeners to exploit short-term, frequency-specific level cues, behavioral performance was stable across repetitions and comparable to earlier studies, suggesting that such cues were unlikely to have played a dominant role. This limitation reflects a current constraint of the model that could be addressed in future.

A second limitation concerns the re-parametrization of certain model stages that was necessary to account for the present dataset—specifically, the DRNL filterbank and the modulation phase-sensitivity stage. The revised CASP implementation (Paulick et al., 2025) employed relatively broad auditory filters (cf. Osses Vecchi et al., 2022, for model comparisons). Although this reduced frequency selectivity had little impact on simulated NH–HI group differences at the high presentation levels used here, it did affect the magnitude of the threshold reduction from AM to the 2 c/o STM condition, leading to a loss of sensitivity at 2 c/o that exceeded the empirical data. This aligns with earlier observations that limited frequency selectivity particularly hampers detection at higher spectral ripple densities, where closely spaced peaks cannot be resolved by broader filters (Mehraei et al., 2014). To address this, we re-instated the original DRNL implementation (Lopez-Poveda & Meddis, 2001), simplified by omitting the characteristic-frequency shift with level. These changes produced overall sharper auditory filters at high levels. In parallel, the phase-sensitivity stage was revised, as the earlier implementation produced substantial residual phase cues at higher modulation frequencies (see the Appendix for implementation details). The updated stage more effectively suppressed these cues. Although these adjustments were validated for backward compatibility with prior datasets (see the Appendix, Section d.), they underscore a broader concern: parameter modifications that improve descriptive accuracy for a specific dataset may compromise generalizability and complicate the functional interpretation of model parameters. Future work should therefore evaluate the robustness of these re-parametrizations across larger and more varied datasets to ensure that the model continues to reflect plausible auditory mechanisms with interpretable parameters.

Finally, a more fundamental limitation concerns the reliance on audiogram-based individualization when attempting to predict supra-threshold performance such as STM detection. STM thresholds are thought to probe aspects of auditory processing that extend beyond audibility and are, therefore, only partially constrained by the audiogram. In this study, peripheral deficits were estimated in two ways: the proportion of OHC loss was derived from ACALOS loudness-growth data using a loudness model to capture recruitment, while IHC loss was inferred from the residual audiometric loss after accounting for OHC contributions. Although this partitioning provides a principled framework, it remains inherently tied to the audiogram and inherits uncertainties from the loudness-model fitting, which propagate into the individualization process. In the present study, the impact of these uncertainties was likely limited, as STM thresholds were relatively insensitive to the precise OHC/IHC parameter values. More broadly, however, audiogram-based individualization cannot capture other contributors to STM variability—such as synaptopathy, central auditory changes, or cognitive factors—and interpreting residual audiometric loss as IHC dysfunction risks conflating peripheral and nonperipheral sources of variability.

Still, pure-tone thresholds often covary with supra-threshold phenomena such as frequency selectivity and loudness recruitment (Sanchez-Lopez et al., 2020), and several studies have shown that audiogram-based predictors explain a substantial proportion of variance in speech outcomes (Bernstein et al., 2016, 2013; Zaar et al., 2023). Within this framework, audiogram-based models may be best viewed as defining the variance attributable to peripheral deficits, thereby clarifying the extent to which residual variability must arise from other mechanisms. This perspective also highlights the potential of STM thresholds themselves to serve as complementary individualization metrics, extending beyond audiogram-based parameters and enabling more comprehensive accounts of interindividual differences in speech perception.

Conclusion

This study examined STM detection in NH and HI listeners and evaluated the ability of an individualized auditory model to predict these STM detection thresholds. CASP reproduced general threshold patterns across temporal and spectral rates but failed to fully capture group differences and the pronounced variability across individuals. Model predictions showed limited sensitivity to OHC and IHC loss estimates, indicating either that other mechanisms contribute to STM deficits or that current implementations of these impairments are insufficient. Within the model framework, modulation-phase sensitivity emerged as a key factor for explaining NH performance at low and high temporal rates, highlighting its potential importance for understanding STM processing more broadly. Future work should investigate the role of modulation-phase sensitivity in speech perception and examine how hearing loss alters this mechanism, alongside contributions from other mechanisms, such as neural deafferentation, central auditory changes, and age-related and cognitive factors. Moreover, future work could explore incorporating STM thresholds as an individualization metric, beyond traditional audiogram-based approaches. This may enable the development of auditory models that more accurately predict speech performance and provide a stronger basis for hearing-aid evaluation.

Appendix: Model Adjustments

Select adjustments were made to the model implementation compared to that described by Paulick et al. (2025). The following sections provide a detailed account of these changes, followed by an assessment of their backward compatibility. Specifically, the modified model was re-evaluated on the same benchmark experiments used by Paulick et al. (2025)—intensity discrimination, forward masking, and modulation detection—to verify that its predictive performance was maintained.

DRNL Filterbank

For the NH simulations, the original parametrisation of the DRNL model proposed by Lopez-Poveda and Meddis (2001) was adopted, rather than the implementation used by Paulick et al. (2025). The key difference lies in the number of cascaded gammatone and low-pass filters in the linear and nonlinear paths. In the modified linear path, three gammatone filters and four low-pass filters were applied (compared to two gammatone and four low-pass filters by Paulick et al. (2025)), while the nonlinear path comprised a cascade of three gammatone and three low-pass filters (compared to two and one, respectively, by Paulick et al. (2025)). The DRNL implementation of Paulick et al. (2025) was aligned with that used in the speech-based sCASP model (Relaño-Iborra et al., 2019). A comparative study of auditory models demonstrated that this version produced comparatively broad filters (Osses Vecchi et al., 2022). By reverting to the original DRNL (Lopez-Poveda & Meddis, 2001) parametrization, we aimed to restore sharper frequency selectivity which proved necessary to capture the trends in the STM data. In addition, one simplification was introduced: the center frequencies and cutoffs of the gammatone and low-pass filters were fixed to the characteristic frequency, independent of level. Unlike the original formulation, no level-dependent frequency shifts with increasing sound pressure level were implemented here. This choice reflects more recent findings that challenge the notion of characteristic-frequency shifts at higher levels (Moore & Glasberg, 2003).

Modulation Phase Sensitivity

The implementation of the modulation phase-sensitivity stage was slightly modified compared to earlier work, while the underlying rationale remained unchanged—namely, to reduce phase information in the internal representation above 10 Hz. In the present study, the real output of the modulation filterbank was first obtained. For filters with center frequencies at and above 10 Hz, the Hilbert envelope of this real output was then calculated and subsequently low-pass filtered with a 10 Hz cutoff, while preserving the overall energy at the filterbank output. For filters centered below 10 Hz, the real filter output was used directly. This procedure produced a stronger suppression of phase information at high modulation frequencies than the earlier implementation. In previous work, phase sensitivity had instead been reduced by taking the real part of the modulation filter output for $f_{\mod} \leq 10$ Hz and the absolute value for $f_{\mod} > 10$ Hz.

Back-End Stage

In the model backend, a supra-threshold template was generated prior to each run and correlated with the three alternative intervals across all dimensions of the internal representations. The correlation was computed as follows:

r = \frac{1}{f_{s}} \sum_{m} \sum_{n \in Ω_{m}} \sum_{t} X_{t n m} \cdot Y_{t n m},

(5)

where

f_{s}

is the sampling frequency,

X = [T \times N \times M]

is the interval signal, and

Y = [T \times N \times M]

is the template. The indices

t

n

, and

m

refer to time, auditory channel, and modulation channel, respectively, while

Ω_{m}

denotes the subset of auditory channels selected by the channel-selection algorithm of Paulick et al. (2025), and constrained such that

f_{n} > 4 f_{m}

for each modulation frequency

f_{m}

. This nonnormalized correlation approach differs slightly from Paulick et al. (2025), where the average over time and frequency was subtracted from both the signal and the template before computing the correlation. Due to the re-implementation of the modulation phase sensitivity for modulation frequencies above 10 Hz, where essentially only the DC component (modulation power) remains, a normalized correlation approach was not feasible.

Backwards Compatibility

The backward compatibility of these adjustments was evaluated by rerunning the benchmark experiments reported by Paulick et al. (2025), including intensity discrimination, forward masking, and modulation detection for NH listeners. Table 2 presents the mean absolute error and Pearson correlation with human data for each task, comparing the predictions of the previous model with those obtained using the modified implementation described above. Overall, similar performance of the two model implementations can be observed across the different experimental conditions.

Table 2.

Backward Compatibility Assessment: MAE and Pearson Correlation ( $r$ ) Comparing the Previous CASP Implementation (Paulick et al., 2025) and the Modified Implementation Across Benchmark Experiments.

Experiment	Condition	MAE		Pearson $r$
		Paulick, 2025	Proposed	Paulick, 2025	Proposed
Intensity discrimination	Tone	0.27 dB	0.20 dB	0.807	0.835
Intensity discrimination	Broadband noise	0.67 dB	0.10 dB	0.224	$-$ 0.334
Forward masking	40 dB	3.47 dB	5.00 dB	0.984	0.992
Forward masking	60 dB	2.24 dB	5.25 dB	0.991	0.999
Forward masking	80 dB	1.70 dB	4.94 dB	0.996	0.995
Modulation detection	3 Hz	6.14 dB	8.79 dB	0.981	0.939
Modulation detection	31 Hz	3.42 dB	2.02 dB	0.798	0.846
Modulation detection	314 Hz	2.21 dB	1.65 dB	0.495	0.953

Abbreviations: MAE = mean absolute error; CASP = Computational Auditory Signal Processing and Perception model.

Footnotes

Acknowledgments

We would like to thank Jonathan Regev for valuable help with the experimental design and setup and providing code for the ACALOS experiment, as well as Johannes Zaar for providing base code for generating the STM stimuli. This work was carried out in connection to the Center for Applied Hearing Research (CAHR) supported by WSA, Oticon, GN Hearing, and the Technical University of Denmark.

ORCID iDs

Lily Cassandra Paulick

Torsten Dau

Helia Relaño-Iborra

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was funded by the Centre for Applied Hearing Research, supported by WSA, Oticon, GN Hearing and the Technical University of Denmark.

Declaration of Conflicting Interest

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Data Availability Statement

The data supporting the findings of this study are openly available in the repository “Dataset for: ‘Predicting spectro-temporal modulation detection thresholds with a functional auditory model”’ in DTU Data at https://doi.org/10.11583/DTU.31079305 (Paulick et al., 2026). The CASP model implementation and simulations for the tasks presented are available at https://gitlab.com/lpau/casp_forafc.git (Paulick et al., 2024).

References

Bernstein

J. G. W.

Danielsson

Hällgren

Stenfelt

Rö Nnberg

Lunner

(2016). Spectrotemporal modulation sensitivity as a predictor of speech-reception performance in noise with hearing aids. Trends in Hearing, 20, 1–17. https://doi.org/10.1177/2331216516670387

Bernstein

J. G. W.

Mehraei

Shamma

Gallun

F. J.

Theodoroff

S. M.

Leek

M. R.

(2013). Spectrotemporal modulation sensitivity as a predictor of speech intelligibility for hearing-impaired listeners. Journal of the American Academy of Audiology, 24(4), 293–306. https://doi.org/10.3766/jaaa.24.4.5

Brand

Hohmann

(2002). An adaptive procedure for categorical loudness scaling. The Journal of the Acoustical Society of America, 112(4), 1597–1604. https://doi.org/10.1121/1.1502902

Buss

Hall

J. W.

Grose

J. H.

(2004). Temporal fine-structure cues to speech and pure tone modulation in observers with sensorineural hearing loss. Ear and Hearing, 25(3), 242–250. https://doi.org/10.1097/01.AUD.0000130796.73809.09

Carney

A. E.

Nelson

D. A.

(1983). An analysis of psychophysical tuning curves in normal and pathological ears. The Journal of the Acoustical Society of America, 73(1), 268–278. https://doi.org/10.1121/1.388860

Chabot-Leclerc

Jørgensen

Dau

(2014). The role of auditory spectro-temporal modulation filtering and the decision metric for speech intelligibility prediction. The Journal of the Acoustical Society of America, 135, 3502–3512. https://doi.org/10.1121/1.4873517

Chalupper

Fastl

(2002). Dynamic loudness model (DLM) for normal and hearing-impaired listeners. Acta Acustica United with Acustica, 88(3), 378–386.

Chi

Gao

Guyton

M. C.

Shamma

(1999). Spectro-temporal modulation transfer functions and speech intelligibility. Journal of the Acoustical Society of America, 106, 2719–2732. https://doi.org/10.1121/1.428100

Dau

(1996). Modeling auditory processing of amplitude modulation. PhD Thesis, Universität Oldenburg.

10.

Dau

Kollmeier

Kohlrausch

(1997). Modeling auditory processing of amplitude modulation. I. Detection and masking with narrow-band carriers. The Journal of the Acoustical Society of America, 102, 2892. https://doi.org/10.1121/1.420344

11.

Dau

Püschel

Kohlrausch

(1996). A quantitative model of the “effective” signal processing in the auditory system. I. Model structure. The Journal of the Acoustical Society of America, 99(6), 3615–3622. https://doi.org/10.1121/1.414959

12.

Dreschler

W. A.

Plomp

(1985). Relations between psychophysical data and speech perception for hearing-impaired subjects. II. The Journal of the Acoustical Society of America, 78(4), 1261–1270. https://doi.org/10.1121/1.392895

13.

Dubno

J. R.

Schaefer

A. B.

(1991). Frequency selectivity for hearing-impaired and broadband-noise-masked normal listeners. The Quarterly Journal of Experimental Psychology. A, Human Experimental Psychology, 43(3), 543–564. https://doi.org/10.1080/14640749108400986

14.

Eddins

D. A.

Bero

E. M.

(2007). Spectral modulation detection as a function of modulation frequency, carrier bandwidth, and carrier frequency region. The Journal of the Acoustical Society of America, 121(1), 363–372. https://doi.org/10.1121/1.2382347

15.

Elhilali

Chi

Shamma

S. A.

(2003). A spectro-temporal modulation index (STMI) for assessment of speech intelligibility. Speech Communication, 41(2-3), 331–348. https://doi.org/10.1016/S0167-6393(02)00134-6

16.

Ewert

S. D.

(2013). AFC – A modular framework for running psychoacoustic experiments and computational perception models. In: The international conference on acoustics AIA-DAGA (pp. 1326–1329). Merano, Italy.

17.

Festen

J. M.

Plomp

(1983). Relations between auditory functions in impaired hearing. The Journal of the Acoustical Society of America, 73(2), 652–662. https://doi.org/10.1121/1.388957

18.

Florentine

(1978). Psychoacoustical tuning curves and narrow-band masking in normal and impaired hearing. The Journal of the Acoustical Society of America, 63(S1), S44–S45. https://doi.org/10.1121/1.2016662

19.

Gabriel

K. J.

Koehnke

Colburn

H. S.

Colburn

H. S.

(1992). Frequency dependence of binaural performance in listeners with impaired binaural hearing. The Journal of the Acoustical Society of America, 91(1), 336–347. https://doi.org/10.1121/1.402776

20.

Glasberg

Moore

B. C. J.

(1989). Psychoacoustic abilities of subjects with unilateral and bilateral cochlear impairments and their relationship to the ability to understand speech. Technical report.

21.

Glasberg

B. R.

Moore

B. C. J.

(1990). Derivation of auditory filter shapes from notched-noise data. Hearing Research, 47, 103–138.

22.

Green

D. M.

(1986). ‘Frequency’ and the Detection of Spectral Shape Change. In Auditory frequency selectivity (pp. 351–359). Boston, MA: Springer US. https://doi.org/10.1007/978-1-4613-2247-4_38.

23.

Green

D. M.

Swets

J. A.

(1988). Signal Detection Theory and Psychophysics. Peninsula Pub.

24.

Hall

J. W.

Tyler

R. S.

Fernandes

M. A.

(1984). Factors influencing the masking level difference in cochlear hearing-impaired and normal-hearing listeners. Journal of Speech and Hearing Research, 27(1), 145–154. https://doi.org/10.1044/JSHR.2701.145

25.

Houtgast

Festen

J. M.

(2008). On the auditory and cognitive functions that may explain an individual’s elevation of the speech reception threshold in noise. International Journal of Audiology, 47(6), 287–295. https://doi.org/10.1080/14992020802127109

26.

Jepsen

M. L.

Dau

(2011). Characterizing auditory processing and perception in individual listeners with sensorineural hearing loss. The Journal of the Acoustical Society of America, 129(1), 262–281. https://doi.org/10.1121/1.3518768

27.

Jepsen

M. L.

Ewert

S. D.

Dau

(2008). A computational model of human auditory signal processing and perception. The Journal of the Acoustical Society of America, 124(1), 422–438. https://doi.org/10.1121/1.2924135

28.

Johannesen

P. T.

Pérez-González

Kalluri

Blanco

J. L.

Lopez-Poveda

E. A.

(2016). The influence of cochlear mechanical dysfunction, temporal processing deficits, and age on the intelligibility of audible speech in noise for hearing-impaired listeners. In Trends in hearing (Vol. 20). SAGE Publications Inc. https://doi.org/10.1177/2331216516641055.

29.

Johannesen

P. T.

Pérez-González

Lopez-Poveda

E. A.

Munoz-Lopez

(2014). Across-frequency behavioral estimates of the contribution of inner and outer hair cell dysfunction to individualized audiometric loss https://doi.org/10.3389/fnins.2014.00214.

30.

Jürgens

Kollmeier

Brand

Ewert

S. D.

(2011). Assessment of auditory nonlinearity for listeners with different hearing losses using temporal masking and categorical loudness scaling. Hearing Research, 280(1-2), 177–191. https://doi.org/10.1016/j.heares.2011.05.016

31.

Levitt

(1971). Transformed up-down methods in psychoacoustics. The Journal of the Acoustical Society of America, 49(2B), 467–477. https://doi.org/10.1121/1.1912375

32.

Lopez-Poveda

E. A.

(2014). Why do I hear but not understand? Stochastic undersampling as a model of degraded neural encoding of speech. https://doi.org/10.3389/fnins.2014.00348.

33.

Lopez-Poveda

E. A.

Johannesen

P. T.

(2012). Behavioral estimates of the contribution of inner and outer hair cell dysfunction to individualized audiometric loss. Journal of the Association for Research in Otolaryngology, 13(4), 485–504. https://doi.org/10.1007/s10162-012-0327-2

34.

Lopez-Poveda

E. A.

Meddis

(2001). A human nonlinear cochlear filterbank. The Journal of the Acoustical Society of America, 110(6), 3107–3118. https://doi.org/10.1121/1.1416197

35.

Lopez-Poveda

E. A.

Plack

C. J.

Meddis

(2003). Cochlear nonlinearity between 500 and 8000 Hz in listeners with normal hearing. Journal of the Acoustical Society of America, 113, 951–960. https://doi.org/10.1121/1.1534838

36.

Lorenzi

Gilbert

Carn

Garnier

Moore

B. C.

(2006). Speech perception problems of the hearing impaired reflect inability to use temporal fine structure.

37.

Magits

Moncada-Torres

Van Deun

Wouters

Van Wieringen

Francart

(2019). The effect of presentation level on spectrotemporal modulation detection. Hearing Research, 371, 11–18. https://doi.org/10.1016/j.heares.2018.10.017

38.

Mehraei

Gallun

F. J.

Leek

M. R.

Bernstein

J. G. W.

(2014). Spectrotemporal modulation sensitivity for hearing-impaired listeners: Dependence on carrier center frequency and the relationship to speech intelligibility. Journal of the Acoustical Society of America, 136, 301–316. https://doi.org/10.1121/1.4881918

39.

Moore

B. C.

Glasberg

B. R.

(2003). Behavioural measurement of level-dependent shifts in the vibration pattern on the basilar membrane at 1 and 2 kHz. Hearing Research, 175(1-2), 66–74. https://doi.org/10.1016/S0378-5955(02)00711-6

40.

Moore

B. C. J.

Glasberg

B. R.

(1997). A model of loudness perception applied to cochlear hearing loss. Auditory Neuroscience, 3, 289–311.

41.

Moore

B. C. J.

Glasberg

B. R.

(2001). Temporal modulation transfer functions obtained using sinusoidal carriers with normally hearing and hearing-impaired listeners. The Journal of the Acoustical Society of America, 110(2), 1067–1073. https://doi.org/10.1121/1.1385177

42.

Moore

B. C. J.

Sek

(1996). Detection of frequency modulation at low modulation rates: Evidence for a mechanism based on phase locking. The Journal of the Acoustical Society of America, 100(4), 2320–2331. https://doi.org/10.1121/1.417941

43.

Moore

B. C. J.

Skrodzka

(2002). Detection of frequency modulation by hearing-impaired listeners: Effects of carrier frequency, modulation rate, and added amplitude modulation. The Journal of the Acoustical Society of America, 111(1), 327–335. https://doi.org/10.1121/1.1424871

44.

Osses Vecchi

Varnet

Carney

L. H.

Dau

Bruce

I. C.

Verhulst

Majdak

(2022). A comparative study of eight human auditory models of monaural processing. Acta Acustica, 6, 17. https://doi.org/10.1051/aacus/2022008

45.

Oxenham

A. J.

Moore

B. C. J.

(1997). Modeling the effects of peripheral nonlinearity in listeners with normal and impaired hearing. Technical report.

46.

Patra

Mukesh

Heinz

M. G.

(2024). Characterizing inner-hair-cell specific dysfunction from spike-train-derived transduction functions using a phenomenological auditory-nerve model. The Journal of the Acoustical Society of America, 155(3_Supplement), A34–A34. https://doi.org/10.1121/10.0026695

47.

Paulick

L. C.

Dau

Relaño-Iborra

(2026). Dataset for: “Predicting spectro-temporal modulation detection thresholds with a functional auditory model”. https://doi.org/10.11583/DTU.31079305.

48.

Paulick

L.C.

Relaño-Iborra

Dau

(2024). “The computational auditory signal processing and perception model (CASP): A revised model implementation” Technical University of Denmark. Model. https://doi.org/10.11583/DTU.26413378.v1

49.

Paulick

L. C.

Relaño-Iborra

Dau

(2025). The computational auditory signal processing and perception model: A revised version. The Journal of the Acoustical Society of America, 157(5), 3232–3244. https://doi.org/10.1121/10.0036535

50.

Plack

C. J.

Drga

Lopez-Poveda

E. A.

(2004). Inferred basilar-membrane response functions for listeners with mild to moderate sensorineural hearing loss. The Journal of the Acoustical Society of America, 115(4), 1684–1695. https://doi.org/10.1121/1.1675812

51.

Plack

C. J.

Oxenham

A. J.

(1998). Basilar-membrane nonlinearity and the growth of forward masking. The Journal of the Acoustical Society of America, 103(3), 1598–1608. https://doi.org/10.1121/1.421294

52.

Plack

C. J.

Oxenham

A. J.

Simonson

A. M.

O’Hanlon

C. G.

Drga

Arifianto

(2008). Estimates of compression at low and high frequencies using masking additivity in normal and impaired ears. The Journal of the Acoustical Society of America, 123(6), 4321–4330. https://doi.org/10.1121/1.2908297

53.

Plomp

(1978). Auditory handicap of hearing impairment and the limited benefit of hearing aids. Journal of the Acoustical Society of America, 63(2), 533–549. https://doi.org/10.1121/1.381753

54.

Püschel

(1988). Prinzipien der zeitlichen Analyse beim Hören. PhD Thesis, University of Göttingen.

55.

Qin

M. K.

Oxenham

A. J.

(2003). Effects of simulated cochlear-implant processing on speech reception in fluctuating maskers. The Journal of the Acoustical Society of America, 114(1), 446–454. https://doi.org/10.1121/1.1579009

56.

Regev

Relaño-Iborra

Zaar

Dau

(2024). Disentangling the effects of hearing loss and age on amplitude modulation frequency selectivity. The Journal of the Acoustical Society of America, 155(4), 2589–2602. https://doi.org/10.1121/10.0025541

57.

Regev

Zaar

Relaño-Iborra

Dau

(2023). Age-related reduction of amplitude modulation frequency selectivity. The Journal of the Acoustical Society of America, 153(4), 2298. https://doi.org/10.1121/10.0017835

58.

Relaño-Iborra

Zaar

Dau

(2019). A speech-based computational auditory signal processing and perception model. The Journal of the Acoustical Society of America, 146(5), 3306–3317. https://doi.org/10.1121/1.5129114

59.

Robles

Ruggero

M. A.

(2001). Mechanics of the mammalian cochlea. Physiological Reviews, 81(3), 1305–1352. https://doi.org/10.1152/physrev.2001.81.3.1305

60.

Sanchez-Lopez

Fereczkowski

Neher

Santurette

Dau

(2020). Robust data-driven auditory profiling towards precision audiology. Trends in Hearing, 24, 1–19. https://doi.org/10.1177/2331216520973539

61.

Schlittenlacher

Moore

B. C. J.

(2016). Discrimination of amplitude-modulation depth by subjects with normal and impaired hearing. The Journal of the Acoustical Society of America, 140(5), 3487–3495. https://doi.org/10.1121/1.4966117

62.

Sheft

Yost

W. A.

(2007). Discrimination of starting phase with sinusoidal envelope modulation. The Journal of the Acoustical Society of America, 121(2), EL84–EL89. https://doi.org/10.1121/1.2430766

63.

Singh

N. C.

Theunissen

F. E.

(2003). Modulation spectra of natural sounds and ethological theories of auditory processing. Journal of the Acoustical Society of America, 114, 3394–3411. https://doi.org/10.1121/1.1624067

64.

Stefan

Volker

Birger

(2019). Modeling loudness growth and loudness summation in hearing-impaired listeners. In Modeling sensorineural hearing loss (pp. 175–185). Routledge. https://doi.org/10.4324/9781315789392-14

65.

Strelcyk

Dau

(2009). Relations between frequency selectivity, temporal fine-structure processing, and speech reception in impaired hearing. The Journal of the Acoustical Society of America, 125(5), 3328. https://doi.org/10.1121/1.3097469

66.

Thorup

Santurette

Jørgensen

Kjaerbøl

Dau

Friis

(2016). Auditory profiling and hearing-aid satisfaction in hearing-aid candidates. Danish Medical Journal, 63(10), 1–5.

67.

van Schijndel

N. H.

Houtgast

Festen

J. M.

(2001). Effects of degradation of intensity, time, or frequency content on speech intelligibility for normal-hearing and hearing-impaired listeners. The Journal of the Acoustical Society of America, 110(1), 529–542. https://doi.org/10.1121/1.1378345

68.

Venezia

J. H.

Martin

A. G.

Hickok

Richards

V. M.

(2019). Identification of the spectrotemporal modulations that support speech intelligibility in hearing-impaired and normal-hearing listeners. Journal of Speech, Language, and Hearing Research, 62(4), 1051–1067. https://doi.org/10.1044/2018_JSLHR-H-18-0045

69.

Viemeister

N. F.

(1979). Temporal modulation transfer functions based upon modulation thresholds. Journal of the Acoustical Society of America, 66, 1364–1380. https://doi.org/10.1121/1.383531

70.

Wallaert

Moore

B. C. J.

Ewert

S. D.

Lorenzi

(2017). Sensorineural hearing loss enhances auditory sensitivity and temporal integration for amplitude modulation. The Journal of the Acoustical Society of America, 141(2), 971–980. https://doi.org/10.1121/1.4976080

71.

Wiinberg

Jepsen

M. L.

Epp

Dau

(2019). Effects of hearing loss and fast-acting compression on amplitude modulation perception and speech intelligibility. Ear & Hearing, 40(1), 45–54. https://doi.org/10.1097/AUD.0000000000000589

72.

Zaar

Simonsen

L. B.

Dau

Laugesen

(2023). Toward a clinically viable spectro-temporal modulation test for predicting supra-threshold speech reception in hearing-impaired listeners. Hearing Research, 427, 108650. https://doi.org/10.1016/j.heares.2022.108650

73.

Zaar

Simonsen

L. B.

Laugesen

(2024). A spectro-temporal modulation test for predicting speech reception in hearing-impaired listeners with hearing aids. Hearing Research, 443, 108949. https://doi.org/10.1016/J.HEARES.2024.108949

Predicting Spectro-Temporal Modulation Detection Thresholds With a Functional Auditory Model

Abstract

Keywords

Introduction

Experimental Methods

Listeners

Procedure and Apparatus

(Spectro)-Temporal Modulation Detection Task

Adaptive Categorical Loudness Scaling (ACALOS)

Statistical Analysis

Auditory Processing Model

Results

Experimental Results

Model Predictions

Discussion

Conclusion

Appendix: Model Adjustments

DRNL Filterbank

Modulation Phase Sensitivity

Back-End Stage

Backwards Compatibility

Footnotes

Acknowledgments

ORCID iDs

Funding

Declaration of Conflicting Interest

Data Availability Statement

References