Abstract
During continuous speech perception, endogenous neural activity becomes time-locked to acoustic stimulus features, such as the speech amplitude envelope. This speech–brain coupling can be decoded using non-invasive brain imaging techniques, including electroencephalography (EEG). Neural decoding may provide clinical use as an objective measure of stimulus encoding by the brain—for example during cochlear implant listening, wherein the speech signal is severely spectrally degraded. Yet, interplay between acoustic and linguistic factors may lead to top-down modulation of perception, thereby complicating audiological applications. To address this ambiguity, we assess neural decoding of the speech envelope under spectral degradation with EEG in acoustically hearing listeners (n = 38; 18–35 years old) using vocoded speech. We dissociate sensory encoding from higher-order processing by employing intelligible (English) and non-intelligible (Dutch) stimuli, with auditory attention sustained using a repeated-phrase detection task. Subject-specific and group decoders were trained to reconstruct the speech envelope from held-out EEG data, with decoder significance determined via random permutation testing. Whereas speech envelope reconstruction did not vary by spectral resolution, intelligible speech was associated with better decoding accuracy in general. Results were similar across subject-specific and group analyses, with less consistent effects of spectral degradation in group decoding. Permutation tests revealed possible differences in decoder statistical significance by experimental condition. In general, while robust neural decoding was observed at the individual and group level, variability within participants would most likely prevent the clinical use of such a measure to differentiate levels of spectral degradation and intelligibility on an individual basis.
Introduction
Auditory speech perception is made possible by the continuous interaction between cognitive processes, including learning, memory, and motor knowledge acquired from speech production, and sensory encoding of the acoustic signal at hierarchically organized levels (Davis & Johnsrude, 2007). Despite this complexity, the neural signature of some aspects of speech perception can be detected using noninvasive brain imaging methods, such as electroencephalography (EEG). Whereas these tools have largely been used to examine the brain response to short, isolated speech sounds, such as syllables or words, listeners’ neural activity also covaries with continuous speech features, including the amplitude envelope or phonetic annotations, across time, thereby enabling the use of engaging, ecologically relevant stimuli (e.g., audiobook recordings; Di Liberto et al., 2015; Ding & Simon, 2014; Golumbic et al., 2013; Lalor & Foxe, 2010). This speech–brain correspondence can be recovered using both linear and nonlinear methods, such as cross-correlation (Aiken & Picton, 2008; Aljarboa et al., 2023), mutual information (Cogan & Poeppel, 2011; De Clercq et al., 2023), and regression analysis (Di Liberto et al., 2015). A key advantage to these approaches is that—unlike traditional, event-related paradigms—they allow researchers to employ naturalistic stimuli capable of reflecting the rich spectro-temporal structure and variability present in every-day speech. Such stimuli are likely to engage, and indeed, require, the recruitment of cognitive processes across all levels of speech perception.
Quantifying this form of stimulus–brain coupling over time can shed light on topics as diverse as speech-in-noise perception (McHaney et al., 2021), selective auditory attention (Baltzell et al., 2016), audiovisual integration (Crosse et al., 2015), and infant language acquisition (Barajas et al., 2021). In addition to answering basic science questions about speech perception, there is increasing interest in neural speech tracking as a potential tool for clinicians (Gillis et al., 2022). Such methods have been proposed as objective measures to assess auditory speech processing (Alikovic et al., 2021; Geirnaert et al., 2021; Gillis et al., 2021; Vanthornhout et al., 2018). These techniques provide a means to directly link behavior with physiology, as the same stimuli and listening paradigms can be used to collect measures from each domain. Still, a detailed understanding of the underlying mechanisms, their reflection in neural response measurements, and potential interactions is required before clinical application. Neural decoding model weights are non-causal and should not be used to draw inferences about the underlying spatial–temporal patterns of brain activity associated with speech perception (Kriegeskorte & Douglas, 2019). They are, however, ideal for clinically oriented research, as decoders are not computationally constrained, do not require many parameters to be manually selected—such as the pre-identification of channels or sensors of interest—and provide a simple, univariate measure of accuracy. Hence, in the current study, we focus on the neural decoding of speech.
Interplay Between Sensory and Cognitive Mechanisms During Speech Perception: The Role of Speech Intelligibility
Speech perception emerges at the interface between sensory processing and cognition; that is, the experience of hearing speech results from the active interpretation and representation, rather than deterministic or passive translation, of acoustic signals as higher-order linguistic constructs (Davis & Johnsrude, 2007; Heald & Nusbaum, 2014; Holt & Lotto, 2010). This complexity is reflected by the physiological organization of the auditory system, which contains numerous and substantial bidirectional connections between cortical and sub-cortical regions (Asilador & Llano, 2021; Terreros & Delano, 2015). In practice, such recurrence in the flow of information across and between levels of the auditory hierarchy suggests that even early sensory processing is affected by computations performed at the cortex (Lesicko & Llano, 2017; Price & Bidelman, 2021). When decoding neural speech tracking, it is therefore possible that contextual factors associated with higher-order processing, such as semantic prediction, may modulate the sensory response to lower-order stimulus features, such as the amplitude envelope. In fact, previous studies reported that neural tracking of acoustic properties was directly affected by manipulations of speech intelligibility, and also significantly correlated with behaviorally measured comprehension in typically hearing adult and children listeners (Iotzov & Parra, 2019; Peelle et al., 2013; Van Hirtum et al., 2023; Vanthornhout et al., 2018), as well as cochlear implant (CI) listeners (Nogueira & Dolhopiatenko, 2022; Verschueren et al., 2019, 2021). Many of these studies did not manipulate speech intelligibility independently, but instead varied speech-to-noise ratios or speech intensity levels as proxies. This approach may lead to changes in physical aspects of the signal, such as reductions in spectral contrast or audibility, which could alter the neural response irrespective of intelligibility per se. Instead, another option for manipulating speech intelligibility is to use unfamiliar natural languages. For example, Etard and Reichenbach (2019) presented English-speaking listeners with stimuli in both English and Dutch, reporting significantly greater neural decoding for English when the speech was heard in noise. However, the authors also found that neural decoding was, overall, strongest for speech without noise, in which case, the magnitude of tracking did not differ between English and Dutch (Etard & Reichenbach, 2019). One recent study presented Dutch and Frisian speech to Dutch-speaking listeners and found similar neural decoding in both languages, with the caveat that some participants reported being able to understand the Frisian stimuli (Gillis et al., 2023). These results are further complicated by a separate, but related strand of research that compares speech tracking across the adult language learning process. Such work suggests that the magnitude of speech–neural coupling during second language (L2) listening becomes greater with increasing L2 proficiency (Lizarazu et al., 2021; Zinszer et al., 2022); however, higher L2 neural decoding may also reflect increased attention and/or effortful listening (Reetzke et al., 2021; Song & Iverson, 2018). Indeed, attention and effortful listening may play a role in many studies that manipulate intelligibility: Participants’ capacity and willingness for sustained, engaged listening is likely to depend on whether they can understand the speech or not. For instance, task-directed attention boosts speech–brain coupling specifically at the timescale of complete phrases (Sokoliuk et al., 2021). Simultaneously watching a silent, unrelated cartoon diminishes auditory speech tracking in noise (Vanthornhout et al., 2019). Selective auditory attention paradigms also reveal that the target among competing speakers can be identified based on neural decoding alone (Baltzell et al., 2016; Biesmans et al., 2016; O'Sullivan et al., 2015), due to a proportionately stronger neural response to the focal stimulus (Ding & Simon, 2012; Golumbic et al., 2013; Rimmele et al., 2015). Higher listening effort, in response to degraded or challenging listening conditions, also appears to enhance speech tracking, such as during reverberant speech perception (Ershaid et al., 2023; Fuglsang et al., 2017). In short, successful reconstruction of acoustic features from neural data relies on sensory and cognitive processes originating across brain networks and regions. Hence, the role of speech intelligibility, and its entanglement with attention and listening effort, needs to be considered for neural speech decoding as an objective measure of auditory speech processing (Lesenfants & Francart, 2020).
Neural Decoding in the Context of CI Listening
One population that may benefit from clinical applications of neural decoding are CI recipients. CIs confer a sense of hearing via electrical stimulation of the recipient's auditory nerve. Over a million CIs have been implanted globally, with patients ranging from infants to the elderly (Zeng, 2022). Typically, CI device fitting is based on subjective patient reports and simple audiometric measures in a time-constrained environment, which may be insufficient for assessing the transmission quality of speech signals. Furthermore, many CI recipients do not experience good speech reception right after switch-on, or have limited verbal communication skills (e.g., infants), impeding their ability to describe the speech transmission quality of their device. After undergoing a time-intensive fitting process in the clinic, it can take weeks or months for an individual to adjust to their CI settings. Unfortunately, it is possible that the CI device was in fact not optimally set up for that patient, and by this point, maladaptive listening strategies may have already become established. Speech benefits should be maximized for every CI recipient, and may crucially depend on the acoustic quality of the perceived speech from day one of CI activation. An efficient and objective measure of naturalistic speech as received by CI recipients would, therefore, present a helpful tool for the device-fitting process in clinics, to assess speech perception outcomes over time, and for the development of improved sound processing strategies in future CI devices. For example, although CIs often can restore speech understanding to a high degree, large inter-patient variability remains (Boisvert et al., 2020; Carlyon & Goehring, 2021). Biophysical factors, such as hearing loss etiology and surgical placement of the CI array, have been shown to explain some of this variability, in addition to complications arising at the electrode–neural interface, such as interactions between stimulating electrodes and auditory nerve fibers (Goehring et al., 2020, 2021). However, it remains unclear why some CI recipients achieve very good speech perception, whereas others do not, and why such large differences in speech reception improvement occur (Holden et al., 2013). Ultimately, the interplay between sensory and cognitive processes discussed in the previous section apply to speech perception with CIs; and given the sparsity and distortions inherent to the perceived CI speech signal, it is likely that central factors are important to speech outcomes in this population (Heydebrand et al., 2007).
Neural speech tracking has been previously measured in CI listeners to examine its relationship with speech reception skills (Verschueren et al., 2019, 2021) and to identify the locus of auditory attention (Dolhopiatenko & Nogueira, 2023; Nogueira & Dolhopiatenko, 2022). A major complication when decoding speech from the brain of CI listeners is the electrical stimulation by the CI itself, which is registered by the EEG recording system concurrently with brain-originating activity, thereby confounding the observed neural representation of the speech envelope. To avoid electrical contamination entirely, researchers may instead simulate CI listening in acoustically hearing listeners using vocoded speech. Vocoder-based CI simulations are inherently limited and cannot account for all aspects of CI hearing (Svirsky et al., 2021), but do enable full experimental control over the perceived speech signal, including its spectral resolution and degradation (Cychosz et al., 2023). A further advantage is that neural responses to CI-simulated speech can be compared to neural responses to unprocessed speech within the same listener as reference.
Disentangling Intelligibility from Spectral Degradation Effects
Previously, we used vocoding to investigate the consequences of spectral degradation on neural decoding of the speech amplitude envelope in 20 typically hearing listeners using EEG (MacIntyre & Goehring, 2023). Despite relatively severe spectral degradation, we found no statistical evidence for an effect on neural decoding, but only included intelligible speech conditions. Had we introduced acoustically similar, but non-intelligible speech, listeners might not have benefited from cognitive processes related to intelligibility such as semantic prediction. In other words, although we had determined neural tracking of intelligible speech to be robust to spectral degradation, non-intelligible speech tracking could diminish as the signal becomes increasingly spectrally degraded.
The current EEG study is, therefore, motivated to disentangle speech clarity and comprehension as potential factors underlying the success of neural decoding. To achieve this, we independently manipulate speech intelligibility and spectral degradation, taking care that acoustic, linguistic, textual, and speaker-specific differences, which may have been confounded with intelligibility in previous work, are minimized here. For non-intelligible listening conditions, we opted to present our English-speaking participants with natural, Dutch language stimuli (Christophe & Morton, 1998; Ramus et al., 2003). To produce similar English and Dutch speech stimuli, we selected a common text source material and recruited an individual, early bilingual speaker to record in both languages, thereby holding speaker-specific characteristics constant. To manipulate acoustic clarity, in addition to natural (unprocessed) speech, we introduce two levels of spectral degradation with vocoded speech and vocoded speech with simulated current spread as it occurs with CIs (Grange et al., 2017). Note that, to address the issue of waning attention during non-intelligible speech perception, we introduce a prosody-focused auditory target detection task that can be performed with or without speech understanding.
This Study: Aims and Hypotheses
Our primary aim is to disentangle intelligibility from spectral degradation effects during the neural decoding of speech. To confirm its use for clinical applications, we conduct our EEG analyses using both subject-specific and group-level neural decoding, illuminating what is captured in common by these different approaches. We also investigate the generalizability of decoding by training and testing within and across intelligibility and spectral degradation conditions, similarly to other studies that trained and tested decoders using differing stimulus materials and listening conditions (e.g., Corcoran et al., 2023; Gillis et al., 2021). Finally, it may be clinically important to establish that an individual patient's neural tracking is decoded significantly above chance, rather than taking the absolute magnitude of decoding accuracy as the objective measure of interest, to account for the substantial variability in neural decoding measures. Here, we employ random permutation testing to describe the proportion of decoders that reach conventional thresholds of statistical significance.
Building on the mixed results of the studies most similar to ours, we propose two hypotheses. For the first hypothesis, and in line with reports of an association between comprehension and speech–brain coupling (Iotzov & Parra, 2019; Peelle et al., 2013; Vanthornhout et al., 2018), we predict that speech tracking is generally better for intelligible (English) in comparison to non-intelligible (Dutch) speech. Previous reports suggest that decreasing acoustic quality may in fact enhance neural tracking of intelligible speech (Hauswald et al., 2022; Yasmin et al., 2023); however, as we saw no evidence for a group-level effect of spectral degradation in our previous work (MacIntyre & Goehring, 2023), we do not expect to find one in the current study. In the second hypothesis, we still expect that intelligibility is associated with stronger speech–brain coupling, but only as an interaction effect with spectral degradation. Specifically, whereas English speech tracking remains robust, Dutch speech tracking diminishes, with declining acoustic clarity. This result would cohere with previous works where a language-based difference was only observed for speech in noise (Etard & Reichenbach, 2019) and no difference was found between two languages presented as unprocessed speech only (Gillis et al., 2023).
Materials and Methods
Participants
A group of 38 adults (22 female and 16 male; aged 18–35, mean = 24.95, SD = 5.09; six self-reported as left-handed) with self-reported typical hearing took part. The target sample size was determined to roughly double that of previous EEG studies reporting intelligibility effects (e.g., Etard & Reichenbach, 2019; Lesenfants et al., 2019). Participants were recruited from the departmental volunteer database, and comprised university students as well as members of the broader local community. Participants reported no history of audiological nor speech and language-related disorders, and spoke English as a primary language with little to no experience of Dutch. Limited exposure to other related languages, namely Afrikaans and German, was reported by six participants, but these individuals verbally confirmed that they could not understand the Dutch stimuli (self-rated “Ability to Follow” Median = 1.75, IQR = 1.00, scale 1–7). All participants provided informed, written consent and received £12.50 per hour for taking part. The study received approval from the Cambridge Psychology Research Ethics Committee and was conducted in accordance with the Declaration of Helsinki.
Stimuli
Recording, Preprocessing, and Editing of Audio
The participants were presented with an audio recording of the Sherlock Holmes story “The Adventure of Charles Augustus Milverton” by Arthur Conan Doyle (Doyle et al., 1903), which is also available as a Dutch translation (Doyle, 1903). Only one of the participants reported having encountered the story previously. For both languages, the text was lightly edited for clarity and modernization. In Dutch, the names of characters and places were changed (e.g., “Holmes” to “Haas”). An early bilingual, adult male speaker was recruited to perform both versions of the story. The speaker spoke American English at school and Dutch in the home during his childhood. The English story was 33 min 47 s, divided into six trials (mean = duration 5 min 38 s, SD = 2.07 s) and the Dutch story was 35 min 54 s, also divided into six trials (mean = duration 5 min 59 s, SD = 4.23 s).
Recordings were sampled at 44.1 kHz. Intensity was normalized by root mean square using a sliding window and corrected by ear for a naturalistic narrative experience; the integrated loudness was −19.90 loudness unit full scale (LUFS) for the English story and −20.03 LUFS for the Dutch version. Audio was filtered to conform to the international long-term average speech spectrum (Byrne et al., 1994; Figure 1, panel A). Long pauses (>500 ms) were manually restricted to 500 ms. Syllabic timing was estimated by windowing the speech recordings into 2 s chunks and automatically detecting vowel onset-like acoustic landmarks (MacIntyre et al., 2022). The grand average inter-vowel onset interval was, for English, 203.47 ms (SD 47.11 ms) and for Dutch, 202.68 ms (SD 38.96 ms). The grand average coefficient of variation for inter-vowel onset intervals was 0.56 (SD 0.23) for English and 0.53 (SD 0.21) for Dutch. Hence, the English and Dutch speech stimuli were well-matched for overall syllable rate and regularity (Figure 1, panels C and D).

Comparisons of English and Dutch speech stimuli. Panel A: Spectral power analysis of the English and Dutch versions of the story. Panel B: Speech amplitude envelopes extracted from unprocessed, vocoded, and vocoded + blurring listening conditions. Panel C: Histograms depicting the distributions of mean inter-vowel intervals (calculated over a 2 s-duration window) in the English and Dutch versions of the story. Panel D: Histograms depicting the distributions of coefficient of variation of inter-vowel intervals (calculated over a 2 s-duration window) in the English and Dutch versions of the story.
Spectral Degradation Conditions
There were three spectral degradation conditions. The first consisted of natural or unprocessed speech (“unprocessed”). To produce the second spectral degradation condition (“vocoded”), we employed the SPIRAL vocoder (Grange et al., 2017). This technique was specifically designed to model CI listening by emulating the physiology of the spiral ganglion and interactions at the electrode–neural interface (Grange et al., 2017). In particular, SPIRAL accounts for the channel interaction mechanism, by which spectral resolution is further degraded with CIs due to overlap in neural excitation, effectively “blurring” the incoming signal (Fu & Nogaki, 2005; Goehring et al., 2020, 2021). Channel interaction also affects temporal speech cues via the smoothing of envelope fluctuations across channels (Oxenham & Kreft, 2014). Behavioral experiments indicate that SPIRAL-based simulations can match typical CI speech reception performance in background noise (Fletcher et al., 2018, 2019; Goehring et al., 2019). We used 16 analysis filter bands, resulting in spectrally degraded but intelligible speech. In the third spectral degradation condition (“vocoded + blurring”), we additionally simulate the deleterious effects of current spread when vocoding the speech by implementing a decay slope of −16 dB/octave. This value is proposed to best match CI speech reception performance based on behavioral data (Grange et al., 2017). As naive listeners rapidly learn to understand vocoded speech (Hervais-Adelman et al., 2011; Samuel & Kraljic, 2009), we address perceptual learning effects over the course of the experiment by counter-balancing the six combinations of language and spectral degradation condition. Both the English and Dutch stories were presented chronologically (i.e., the narrative structure was preserved), alternating trials between languages. With two trials per listening condition, this resulted in 12 unique condition orders.
Behavioral Measures
Repeated-Phrase Detection Task
The behavioral listening task was designed to sustain auditory attention throughout the experiment, even when the speech stimulus was unintelligible to participants. It consisted of a repeated-phrase detection paradigm. Specifically, the audio would repeat over the course of one short phrase, resulting in one additional occurrence of the phrase after first presentation, after which point the story continued. Such a task avoids the introduction of non-speech target sounds (e.g., tones or noise bursts), or interspersing recognizable English target words within Dutch speech, which may not require active attention to register. Repetitions occurred pseudo-randomly once every ∼45 s and each trial contained eight repeated phrases in total. The mean repeated-phrase duration was 2.06 s (SD 0.04 s). This value was determined via online piloting (n = 20 English speakers, aged 18–65), which showed that the task became more difficult for Dutch stimuli as the phrase length exceeded 3 s. Before beginning the experiment, participants were given a practice session with feedback automatically provided on accuracy and reaction times. There were six practice trials (duration range 7–9 s), with one trial for each experimental listening condition. Performance in the practice session was monitored by the experimenter to ensure the task was understood. Target detection was recorded via button press using a custom-made USB button box that the participants held in one hand on their lap.
Self-Reported Engagement and Ability to Follow
At the end of each trial, participants were asked to rate the preceding story section in response to two questions using a seven-point Likert scale. The questions were “How engaging did you find the story?” and “How well could you follow the story?”, and the order of question presentation alternated by trial. A rating of “1” meant that the participant did not feel at all engaged by the story or that the participant could not follow the story content at all. A rating of “7” meant that the participant felt highly engaged by the story or that they could easily understand the story content. “Engagement” was explained to the participant as feeling actively interested in the plot and/or outcome of the story. “Ability to follow” was explained as comprehension of the plot and dialogue. Participants’ responses were recorded using the left and right arrow keys of a USB keyboard.
Preprocessing of Behavioral Data
To quantify performance in the repeated-phrase target detection task, hit and false alarm rates were measured at the trial level. Reaction times were also aggregated at the trial level by taking the median and inter-quartile range. For all behavioral measures (i.e., hit and false alarm rates, median reaction times, and self-report ratings), we took forward the mean value of the two trials associated with each condition for statistical analysis.
EEG Acquisition and Preprocessing
The experiment was carried out in an acoustically and electromagnetically shielded booth. 64-channel EEG was digitized at 2048 Hz using a BioSemi ActiveTwo EEG system. Scalp electrodes were placed according to the International 10–20 system. Stimuli were presented binaurally via an RME Fireface UCX (RME, Haimhausen, Germany) external soundcard and ER-2 insert earphones (Etymotic Research, Elk Grove Village, Illinois). Presentation levels were held constant across all participants and throughout experimental sessions, with the experimenter verbally confirming with each individual that the stimulus intensity was clearly audible and comfortable. To establish a common time series, EEG triggers were controlled using an additional channel of the soundcard. Participants interacted with the experiment using a custom GUI programmed in Psychtoolbox (Kleiner et al., 2007). While listening to speech stimuli, they were asked to refrain from unnecessary movement and to focus their eyes on a fixation cross (Thaler et al., 2013). Participants had the opportunity to take a short, self-paced break for blinking and stretching after each trial. At the experiment halfway point, the participant was invited to take a longer break along with the offer of a drink and snack. Each experimental session, including EEG preparation and debriefing, lasted approximately 2–2.5 h.
All preprocessing was conducted using custom routines programmed in MATLAB 2020b (MathWorks Inc., Natick, Massachusetts) with functions from the Fieldtrip (Oostenveld et al., 2011) and noisetools (de Cheveigné & Arzounian, 2018) toolboxes. The EEG data were re-referenced to the channel average and resampled at 256 Hz for computational efficiency. Zapline-plus was applied to remove residual line noise (Klug & Kloosterman, 2022). The data were highpass filtered at 0.5 Hz and then lowpass filtered at 40 Hz using fourth-order Butterworth filters. Noisy channels were visually identified and excluded. Eye blink artifacts were removed using independent component analysis (runica algorithm). Previously identified noisy channels were then interpolated using a weighted neighbors approach. The first and final 5 s of each trial were trimmed to avoid stimulus onset-related responses and filtering-related artifacts. Both the EEG and acoustic envelope signals were resampled to 100 Hz in order to compute the decoders.
Measure of Cortical Speech Tracking
The acoustic stimulus envelope was extracted by passing the speech signal through a Gammatone filter-bank (28 filters equally spaced on the equivalent rectangular bandwidth scale between 50 and 5000 Hz); rectifying and applying a power law to each sub-band output; and then recombining the resultant envelopes by taking their average (Biesmans et al., 2016). Finally, the envelope was lowpass filtered at 10 Hz using an eighth-order Butterworth filter to form the broadband envelope. A linear comparison of envelopes across the three levels of spectral degradation revealed that all pairwise correlations were very strong (r > 0.985, p < .001).
To assess neural tracking of the speech envelope, we trained linear decoders using the multivariate temporal response function (mTRF) toolbox (Crosse et al., 2016). The mTRF toolbox applies regularized ridge regression to obtain a quantitative mapping between stimulus features—here, the speech amplitude envelope—and the recorded neural response. The objective is to reconstruct or predict the input stimulus from EEG (Crosse et al., 2016, 2021). Within this framework, the decoding model g (τ, n) returns the stimulus s^(t) as a linear convolution of the neural response r (t, n) over a range of time lags τ, or
When considering practical applications for example in a clinical context, it is currently unknown whether it would be preferable to build decoder models on a patient-by-patient basis or whether a generic decoder model should be used for assessing speech envelope decoding outcomes, and whether these two approaches would produce different results. We therefore trained two types of decoders: subject-specific models using data from individual participants and group-level models using data pooled across participants. We trained the decoders by minimizing the mean-squared error using k-fold, leave-one-out cross-validation (Crosse et al., 2016, 2021). In the case of the participant-specific decoders, we split the ∼10 min of training data into four folds; for the group decoders, we trained on n–1 × ∼10 min folds. The time lags encompassed the range from −50 to 400 ms post-stimulus. The optimized ridge parameter λ, which controls the regularization penalty, was empirically determined using the same format of k-fold cross-validation within logarithmically spaced values ranging between 10−7 and 107. As our goal was to maximize decoding accuracy with a view to clinical applications, an individual λ value was chosen for each model; however, optimal λ was 100 for the majority (30/38) of subject-specific decoders (Range=[10, 1000]), and 1000 for all group-level decoders.
Statistical Analysis
To estimate the effects of intelligibility and spectral degradation on speech envelope reconstruction accuracy whilst accounting for differences across participants, we conducted linear mixed effect modelling in R 4.0.3 (R Core Team, 2021) with lme4 (Bates et al., 2014). These models were estimated using restricted maximum likelihood and optimx optimiser. When modelling neural decoding accuracy as an r-value, the primary predictors of interest were the factors language (trained and test) and spectral degradation (trained and test). Random intercepts and slopes within participants were fit subject to model convergence. The reference levels were trained and test on English, unprocessed speech. Model residuals were inspected using visualizations and diagnostic tests of normality from the DHARMa package (Hartig, 2017). The selection of fixed effect terms was guided by comparing models with the Akaike information criterion (AIC), a metric that balances the model's goodness of fit against its parsimony (Burnham et al., 2011). The significance of fixed effects was tested using Type II Wald F tests with Kenward–Roger degrees of freedom (Luke, 2017) using the pbkrtest package (Halekoh & Højsgaard, 2014), and R2sp values (calculated with r2glmm package) are provided as a measure of variance explained by each predictor (Jaeger et al., 2017). Planned and posthoc comparisons were conducted using estimated marginal means implemented in easystats package (Lüdecke et al., 2022). For correlational analyses, continuous variables of interest included hit rate, false alarm rate, and mean median reaction time in the repeated-phrase detection task, as well as the ordinal variables of self-reported engagement and ability to follow.
Bonferroni-adjusted p-values are reported where relevant. In-text confidence intervals are obtained by using bootstrapped sampling with replacement (n = 1000 iterations).
Establishing Decoder Significance
The primary outcome variable is decoder accuracy as an r-value, following convention. However, the individual significance level of each decoder, relative to an empirically determined null distribution, may also be informative. To obtain decoder significance, we first generated a null distribution by segmenting the training speech envelope into ∼3 s chunks, shuffling the segments (prohibiting any coincidental true pairings), and re-concatenating them. Segmentation was applied at signal zero-crossings and the segments were smoothly joined using interpolation, thereby avoiding the introduction of artifacts to the permuted envelope. We then trained and tested null decoders, within-listening condition, using this shuffled stimulus and the unmodified neural response 1000 times. When the observed r-value exceeds the 95th percentile of the null distribution, the decoder can be considered as significant at p < .05. We describe the frequency and variability of decoder significance by listening condition.
Results
Behavioral Measures
Self-Reported Ability to Follow and Engagement
Summary descriptive statistics for the behavioral results by listening condition are shown in Table 1 and Figure 2. Averaging across levels of acoustic clarity, participants rated their Ability to Follow the English story as Median = 6.00 (IQR = 2.0), and the Dutch story, Median = 1.00 (IQR = 0.50) on a scale from 1 to 7. Self-reported engagement was high in English, with even the least engaging listening condition, vocoded + blurring, receiving a median score of 5.50 (IQR = 3.00). For Dutch, however, the overall median engagement score was just 1.50 (IQR = 1.00) on a scale from 1 to 7.

Panel A: Box plots and group mean with bootstrap 95% confidence intervals depicting self-reported ability to follow the story. Panel B: Box plots and group mean with bootstrap 95% confidence intervals depicting self-reported engagement with the story. Panel C: Dot-and-whisker plot depicting standardized estimates (regression coefficients) with 95% confidence intervals from the linear mixed effects model of mean median reaction times in the repeated-phrase target detection task. The model reference levels were English unprocessed. Panel D: Box plots and group mean with bootstrap 95% confidence intervals depicting mean median reaction time by listening condition. Significance values reflect the outcome of pairwise tests of estimated marginal means between levels of spectral degradation with Bonferroni correction.
Summary Statistics of Behavioral Data for the Repeated-Phrase Detection Task and the Participants’ Self-Reported Measures of Ability to Follow the Story in Terms of Plot and Dialogue, and Engagement, Referring to Their Sense of Absorption or Interest in the Outcome of the Story.
Repeated-Phrase Detection Task
Accuracy in auditory target detection ranged from a mean hit rate of 0.89 (SD 0.16) and a mean false alarm rate of 0.09 (SD 0.13) for Dutch vocoded + blurring, to a mean hit rate of 1.00 (SD 0.05) and a mean false alarm rate of 0.02 (SD 0.03) for English unprocessed. Hence, task performance was generally good across all listening conditions. The linear mixed effect model analysis of reaction time data (Figure 2, panel C) consisted of a main effect of language, F(1.00,37.00) = 42.25, p < .001, R2sp = 0.53, 95% CI [0.34,0.71], and of spectral degradation, F(2.00,150.00) = 45.58, p < .001, R2sp = 0.38, 95% CI [0.27,0.49]. Including an interaction term between language and spectral degradation did not improve model fit (Supplemental Appendix A, Table A1).
Reaction times to Dutch targets (mean = 1.26 s, SD = 0.29 s) were slower than to English targets (mean = 1.10 s, SD = 0.27 s), estimate = 0.16, 95% CI [0.11, 0.21], t(37.00) = 6.50, p < .001. With respect to spectral degradation, in comparison to unprocessed speech (mean = 1.10 s, SD = 0.25 s), vocoded speech (mean = 1.15 s, SD = 0.27 s) evoked slower reaction times, estimate = 0.05, 95% CI [0.01, 0.10], t(150.00) = 2.43, p = .016. Vocoded + blurring (mean = 1.30 s, SD = 0.32 s) was also associated with slower reaction times in comparison to unprocessed speech, estimate = 0.20, 95% CI [0.16, 0.24], t(150.00) = 9.21, p < .001. Finally, vocoded + blurring reaction times were also slower than those evoked by vocoded speech, estimate = −0.15, 95% CI [−0.20, −0.09], t(150.00) = −6.78, pbonf < .001. Standardized estimates and additional model details are provided in Supplemental Appendix A, Tables A2−A3).
In summary, reaction times to Dutch and spectrally degraded repeated-phrase targets tend to be slower than responses to English unprocessed targets. However, there does not appear to be an interaction between language and spectral degradation. Linking response speed to the self-report measures, we found that participants with faster reaction times to auditory targets on average tended to have worse self-described ability to follow, rs = −0.58, 95% CI [−0.73, −0.17], p < .001; as well as lower levels of engagement, rs = −0.43, 95% CI [−0.67, −0.21], p = .007.
Neural Decoding of the Speech Signal
Subject-Specific Decoding Accuracy
Each subject-specific decoder was trained on 10 min of data from a single listening condition, and tested on 1 min of held-out test data from each of the six listening conditions. The linear mixed effect model of subject-specific decoding accuracy (Figure 3) included main effects of trained: spectral degradation, F(2.00,1245.00) = 9.42, p < .001, R2sp = 0.01, 95% CI [0.01,0.03], and test: spectral degradation, F(2.00,37.00) = 4.16, p = .023, R2sp = 0.18, 95% CI [0.04,0.44], as well as their interaction, F(4.00,1245.00) = 41.74, p < .001, R2sp = 0.12, 95% CI [0.09,0.15]. In addition, the model also included a non-significant predictor of trained: language, F(1.00,1245.00) = 3.45, p = .064, R2sp = 0.00, 95% CI [0.00,0.01], and a main effect of test: language, F(1.00,1245.00) = 55.29, p < .001, R2sp = 0.04, 95% CI [0.02,0.07], as well as an interaction between trained: language and test, F(1.00,1245.00) = 26.34, p < .001, R2sp = 0.02, 95% CI [0.01,0.04]. AIC indicated that including interaction terms between the spectral degradation- and language-based predictors did not improve model fit (Supplemental Appendix A, Table A4). Full details are reported in Supplemental Appendix A, Table A5.

Subject-specific speech decoding accuracy. Panel A: Dot-and-whisker plot depicting standardized estimates (regression coefficients) with 95% confidence intervals from the linear mixed effects model of subject-specific speech decoding accuracy. The model reference levels were English unprocessed (trained and test), with red colors depicting negative estimates, and blue, positive estimates. Panel B: Box plots and group mean with bootstrap 95% confidence intervals depicting speech decoding accuracy by trained: spectral degradation and test: spectral degradation. Data are collapsed across language conditions. Panel C: Box plots and group mean with bootstrap 95% confidence intervals depicting speech decoding accuracy by trained: language and test: language. Data are collapsed across spectral degradation conditions. Significance values reflect the outcome of pairwise tests of estimated marginal means with Bonferroni correction.
Spectral Degradation
To investigate the interaction term between trained: spectral degradation and test: spectral degradation, we conducted pairwise comparisons of the estimated marginal means (Figure 3, panel B; Supplemental Appendix A, Table A6). In general, the more similar trained: spectral degradation was to test: spectral degradation, the more decoding accuracy improved. For example, if the decoder was trained: unprocessed, then test: unprocessed (mean = 0.15, SD = 0.08) was better decoded than vocoded + blurring (mean = 0.12, SD = 0.07), pbonf = 0.007, d = 0.46, 95% CI [0.16, 0.75]. An analogous, but reverse pattern was observed for trained: vocoded + blurring decoders, such that test: vocoded + blurring (mean = 0.16, SD = 0.06) was decoded more accurately than unprocessed (mean = 0.12, SD = 0.06), pbonf < .001, d = −0.64, 95% CI [−0.94, −0.33]. Decoders that were trained: unprocessed or vocoded + blurring could each decode test: vocoded speech as well as the listening condition they had been trained on (pbonf ≥ .478). However, when the decoder was itself trained: vocoded, then test: vocoded elicited higher decoding accuracy (mean = 0.17, SD = 0.08) than both unprocessed (mean = 0.15, SD = 0.07), pbonf < .001, d = −0.54, 95% CI [−0.82, −0.25]; as well as vocoded + blurring (mean = 0.15, SD = 0.07), pbonf = 0.036, d = 0.35, 95% CI [0.08, 0.61].
Language
We next examined the interaction between trained: language and test: language (Figure 3, panel C; Supplemental Appendix A, Table A7). When trained: English, decoding accuracy was higher for test: English (mean = 0.16, SD = 0.07) in comparison to Dutch (mean = 0.13, SD = 0.06), pbonf < .001, d = 0.25, 95% CI [0.20, 0.31]. Decoders that were trained: Dutch, however, did not distinguish between test: English (mean = 0.14, SD = 0.07) or Dutch (mean = 0.14, SD = 0.07), pbonf = 0.104, d = 0.05, 95% CI [−0.01, 0.10]. Thus, contrary to English, there was no apparent advantage for testing on Dutch when having trained on Dutch. Although the effect of test: language was modest, it was relatively consistent across participants.
Correlations Between Brain and Behavior
In the following analysis, trained and test listening conditions were always matched. As group performance in the repeated-phrase detection task was close to the ceiling, it was impractical to correlate hit or false alarm rate with neural decoding accuracy. The only listening condition with some variance in target detection success was Dutch vocoded + blurring; however, we observed no trend linking speech tracking to hit rate, p = .616, nor false alarm rate, p = .532. With regards to reaction times, slower response speed was associated with better neural decoding accuracy in English unprocessed, r = 0.38, 95% CI [0.09, 0.60], p = .019, but not for any other listening condition (p ≥ 0.242). The wide confidence intervals of the estimate suggest that, if truly present, this correlation would be subject to inter-individual variability. Turning to self-reported ability to follow, we correlated this measure with neural decoding in English vocoded + blurring–the only intelligible condition with between-participant differences—and found no obvious association, p = .504. However, participants differed in their use of the Likert scale, with some participants using its full range across experimental conditions, and others not. To account for this, we subtracted ability to follow in English vocoded + blurring from English unprocessed. This gives an individualized measure of relative change in comprehension. We also performed the same operation for decoding accuracy, to produce a measure of relative change in speech tracking. At the group level, there was a correlation between change in ability to follow and change in neural tracking, r = 0.35, 95% CI [0.02, 0.60], p = .030. In other words, participants who rated their ability to follow as relatively poor for English vocoded + blurring were also more likely to have relatively worse speech tracking in that condition, in comparison to English unprocessed.
Whereas this result coheres with proposals that EEG can predict intelligibility (e.g., Iotzov & Parra, 2019; Vanthornhout et al., 2018), we cannot be sure that changes in decoding accuracy are solely attributable to a loss in intelligibility and were not influenced by acoustic differences. To tease apart these factors, we took advantage of the Dutch decoding data, by subtracting neural decoding in Dutch vocoded + blurring from unprocessed. This third variable provides an independent estimate of individual response to spectral degradation that is not expected to vary with intelligibility. We used it as a covariate to calculate the partial correlation between change in ability to follow and change in neural tracking in English, showing that r = 0.29, 95% CI [−0.11, 0.60], p = .077. Thus, the association between self-reported comprehension and neural decoding does appear to be mediated by spectral degradation, although the wide confidence intervals of the effect suggest considerable inter-individual variability. As engagement closely tracked with ability to follow, we did not examine it further.
Group Decoding
Group-level decoding was performed similarly to subject-specific decoding, except instead of training the decoder on a single participant, we used data from n −1 participants and held out the nth participant's data for testing. This procedure was repeated for each of the 38 participants. The linear mixed effects model (Figure 4) of group decoding accuracy contained a main effect of trained: spectral degradation, F(2.00,1247.00) = 7.07, p < .001, R2sp = 0.01, 95% CI [0.00,0.03], and non-significant predictor of test: spectral degradation, F(2.00,37.00) = 2.21, p = .124, R2sp = 0.11, 95% CI [0.01,0.35]; in addition, there was an interaction between trained: spectral degradation and test: spectral degradation, F(4.00,1247.00) = 17.70, p < .001, R2sp = 0.05, 95% CI [0.03,0.08]. Unlike the subject-specific decoders, AIC indicated that the predictor trained: language did not improve model fit, neither as a main effect nor interaction term (Supplemental Appendix A, Table A8). There was, however, a main effect of test: language, F(1.00,1247.00) = 49.19, p < .001, R2sp = 0.04, 95% CI [0.02,0.06]. Full details are provided in Supplemental Appendix A, Table A9.

Same as Figure 3 but for group speech decoding accuracy.
Spectral Degradation
The pattern of the interaction between trained: spectral degradation and test: spectral degradation was similar to subject-specific decoding. However, negative effects of mismatching trained and test conditions were less consistent (Figure 4, panel B). For example, when the decoder was trained: unprocessed, the statistical effect of test: unprocessed (mean = 0.11, SD = 0.06) was only marginally better than vocoded + blurring (mean = 0.08, SD = 0.06), pbonf = .053, d = 0.35, 95% CI [0.06, 0.64]. In addition, when the decoder was trained: vocoded + blurring, then test: vocoded + blurring (mean = 0.11, SD = 0.06) was decoded with only-just significantly higher accuracy than unprocessed (mean = 0.08, SD = 0.06), pbonf = .047, d = −0.36, 95% CI [−0.65, −0.07]. Finally, at the group level, trained: vocoded decoders achieved comparable reconstruction whether test: unprocessed, vocoded, or vocoded + blurring. Importantly, for all pairwise contrasts, the wide intervals of the estimates and effect sizes indicate substantial variability in the neural response to spectral degradation between participants. Full details of the tests of estimated marginal means are given in Supplemental Appendix A, Table A10.
Language
As indicated by the main effect, test: English (mean = 0.11, SD = 0.07) was decoded with higher accuracy than Dutch (mean = 0.09, SD = 0.06), estimate = 0.02, 95% CI [0.01, 0.02], t(1247.00) = 7.01, pbonf < .001, Cohen's d = 0.20, 95% CI [0.14, 0.25], regardless of the training language. As with the subject-specific decoding, the test: language effect was small in magnitude, but more consistent across participants when compared to the effects of spectral degradation.
Correlations Between Brain and Behavior
Similarly to the subject-specific results, in the Dutch vocoded + blurring listening condition, there was no clear linear relationship between speech tracking and hit rate (p = .154) nor false alarm rate (p = .109). The positive association with slower reaction times in English unprocessed was present, but numerically smaller than for subject-specific decoding, r = 0.33, 95% CI [0.01, 0.04], p = .041; all other p ≥ .063). Unlike the subject-specific analysis, however, change in self-reported ability to follow and change in decoding accuracy did not significantly correlate, r = 0.26, 95% CI [−0.01, 0.64], p = .121.
Generalizability of Neural Decoding
Subject-Specific and Group Decoding
To quantify the similarity between subject-specific and group-level decoding performance across listening conditions, we calculated Kendall's concordance coefficient, W, within participants. This was performed using data where the decoder was trained and tested within the same listening condition. For 30 of 38 participants, agreement between their subject-specific and group-level decoding results was W > 0.70. The median value was 0.79 (IQR = 0.20, range = [0.29, 0.97]), which indicates good-to-very-good agreement between subject-specific and group-level decoding accuracy within participants. Although Kendall's W shows high consistency in terms of the direction or ranking of results across listening condition, we were also interested in absolute differences in accuracy between subject-specific and group-level decoding. We, therefore, subtracted each group decoder's accuracy from each subject-specific decoder's accuracy, within participants and by listening condition, before aggregating at the participant level. The group mean difference—a loss—in accuracy was 0.06 (SD = 0.04). We can scale this difference by dividing it by the subject-specific decoding value (within participant), resulting in the group mean = 0.31, SD = 0.20. This can also be described as an average reduction in decoding accuracy of 30% from the subject-specific value (Figure 5, top panel).

Histograms depicting the scaled difference in neural decoding accuracy when subtracting group-level from subject-specific values (panel A), or when subtracting non-matched from matched training and testing listening conditions (panels B and C). Scaling is performed by dividing the resulting differences by the subject-specific (panel A) or matched training and testing (panels B and C) decoding accuracy values.
Testing and Training Across Listening Conditions
We also examined the global effects of language and spectral degradation when trained and test listening conditions are mismatched. Namely, within participant, we subtracted the mismatched (“across”) decoding value from the matched (“within”) decoding value. First, to examine the language contrast, we pooled subject-specific and group decoding, as well as all three levels of spectral degradation. The mean loss in accuracy due to training and testing across language was 0.01 (SD = 0.01), or when scaled by the within-language accuracy, mean = 0.07, SD = 0.10 (Figure 5, middle panel). Next, to look at the spectral degradation contrast, we pooled subject-specific and group decoding, as well as language. In this case, the mean loss in accuracy was 0.02 (SD = 0.01), or when scaled, mean = 0.16, SD = 0.08 (Figure 5, bottom panel). A paired t-test of the scaled differences confirmed that the proportionate effects of training and testing across listening condition are greater for spectral degradation than for language, t(37) = −5.84, p < .001, d = 0.96, 95% CI [−1.34, −0.57].
Significance of Decoders
Under clinical settings, it may be useful to select a cut-off value at which point a patient's neural tracking can be said to be statistically significant. We, therefore, performed a descriptive analysis of decoder significance with a focus on outcomes for individual participants, rather than group-level effects. Decoder statistical significance was established by comparing observed accuracy (i.e., the r-value) ranked against the accuracy of decoders trained on randomly permuted versions of the same stimulus speech amplitude envelopes (n = 1000 permutations). To quantify the variability of the percentile rank value itself, we calculated median absolute deviation (MAD) at the group level. MAD is a measure of spread analogous to standard deviation, but is more robust to outliers and appropriate for characterizing non-Gaussian distributions. Hence, we use MAD here to estimate the consistency, across participants, of the decoder percentile rank relative to its empirically determined, null distribution.
Subject-Specific Decoding
Figure 6, panel A depicts the proportion of subject-specific decoders that met typically accepted levels of statistical significance. English unprocessed speech was the only listening condition where subject-specific decoding at p < .05 occurred for every participant. Moreover, MAD = 0.65 percentiles (95% CI [0.36, 1.14]), indicating little variability in the rank of observed decoder accuracy across participants. For English vocoded speech, however, MAD = 5.05 percentiles (95% CI [0.93, 12.99]); and for English vocoded + blurring, MAD = 5.94 percentiles (95% CI [2.08, 14.01]). These larger values of MAD and wider confidence intervals may indicate that, in the current sample, increasing spectral degradation resulted in less consistent significance testing: In particular, for English, the proportion of decoders significant at p < .001 grew from 58% in unprocessed speech to 71% for vocoded + blurring, but so did the proportion of decoders that failed to reach significance at p < .05, from 0% to 11%, respectively. Turning to Dutch unprocessed, 21% of decoders (n = 8) were non-significant at p < .05 in the current sample. The percentile rank of Dutch unprocessed accuracy was also estimated to be the most variable across all the listening conditions, MAD = 9.67 percentiles (95% CI [4.65, 16.43]). Moreover, opposite to the pattern observed for English decoding, variability in significance testing decreased for Dutch vocoded, MAD = 3.93 percentiles (95% CI [1.44, 10.70]) and Dutch vocoded + blurring, MAD = 1.34 percentiles (95% CI [0.50, 2.77]).

Bar plots indicating the significance level, determined via random permutation testing with n = 1000 iterations, at which neural decoding was decoded across listening conditions by subject-specific decoders (panel A) and group decoders (panel B). Trained and test data were from matching listening conditions.
Group Decoding
The proportions of group decoders to achieve conventional levels of significance were roughly similar across both English and Dutch, with ≥ 28% of decoders failing to surpass the p < .05 cut-off across listening conditions (Figure 6, panel B). In other words, about one third of group decoders may not meet a clinically acceptable threshold of statistical significance in the current sample. With respect to the variability of percentile rank across decoders, MAD was smallest, or least dispersed, for unprocessed speech, with roughly overlapping confidence intervals across English, MAD = 6.06 (95% CI [3.20, 12.05]) and Dutch, MAD = 6.81 (95% CI [4.16, 11.63]). This level of variability was also comparable for Dutch vocoded, MAD = 7.44 (95% CI [3.52, 12.44]) and Dutch vocoded + blurring (MAD = 8.08, 95% CI [5.12, 13.23]). We observed possibly higher MAD with wider confidence intervals for English vocoded, MAD = 10.52 (95% CI [5.23, 18.47]), as well as English vocoded + blurring (MAD = 12.12, 95% CI [6.20, 21.49]). Hence, like the subject-specific decoding, we find numerical differences in the frequency and variability of decoding significance levels achieved by listening condition. However, unlike the subject-specific decoding, the broad pattern of differences is shared across English and Dutch speech.
Discussion
Intelligibility Boosts Speech Envelope Reconstruction Accuracy, but the Effect is Small
A central question in speech perception research is what neural tracking of the amplitude envelope can disclose about comprehension (Ding & Simon, 2014). For the purposes of clinically assessing auditory speech processing, an equally pertinent question is whether comprehension contributes to neural tracking (our first hypothesis). This is relevant where auditory input is likely to be severely spectrally degraded, as would be the case for CI patients. For example, an individual with relatively poor spectral resolution might nonetheless show good neural speech tracking under optimal listening conditions. Speech intelligibility could, therefore, obscure real problems occurring early in the auditory processing hierarchy, such as issues at the electrode–neural interface, or sub-optimal device configuration. Conversely, if accurate decoding is dependent on speech comprehension, this would prevent its use as an objective measure of auditory encoding in many clinical populations. Hence, it is crucial to understand the relationship between speech stimulus clarity and intelligibility as one of many endogenously originating factors that could shape early sensory processing. In the current study, we address this challenge by presenting participants with intelligible and non-intelligible speech at varying levels of spectral degradation. We attempted to minimize the influence of stimulus properties that were previously confounded with intelligibility, including acoustic properties, speaker and voice characteristics, and auditory attention. We, moreover, quantify the generalizability of speech decoding in the context of our experimental manipulations, as well as across subject-specific versus group-level reconstruction techniques. Our findings cohere with an emerging consensus that envelope tracking is possible without speech understanding (Baltzell et al., 2017; Gillis et al., 2023; Karunathilake et al., 2023; Kösem et al., 2023), yet intelligibility nonetheless exerts an enhancing effect on decoding accuracy in the current dataset and thereby supports our first hypothesis.
Our linear mixed modelling results suggest the effect of intelligibility is consistent, but small. Importantly, we found no interaction between spectral degradation and intelligibility on speech tracking and thereby no support for our second hypothesis, although key questions remain. One relates to the methodology itself, namely, the issue of decoder significance. Although our analysis and hypotheses focused on decoding accuracy across participants, a clinician may be more interested in decoder performance at the patient level. Our descriptive analysis showed that the rank of the observed decoder accuracy relative to its null distribution was inconsistent across decoding methods and listening conditions. Future work with inferential statistics should confirm our observations here, but it is potentially problematic that only English unprocessed speech was decoded significantly for all participants. Given that the permuted feature, the amplitude envelope, is largely unchanged by spectral degradation and appears to share basic temporal properties across English and Dutch, fluctuations in the false-positive rate (i.e., where a randomly permuted envelope is decoded with higher accuracy than the true stimulus envelope) are presumably attributable to differences in the listener's neural response. It is possible that some brain states are more amenable to spurious correlations with the speech envelope: We can envision this possibly occurring due to enhanced oscillatory activity at the syllable rate, as one hypothetical example.
Another unresolved question is whether individual measures of comprehension can be predicted by speech envelope decoding. We observed a tenuous link between self-reported comprehension and subject-specific neural decoding, but its magnitude was reduced when accounting for the brain response to spectral degradation, independent of understanding. One limitation in our study is that we did not collect any objective measure of speech comprehension, such as asking participants about the narrative content of the story. This decision was made partly in the interest of time, and partly because an equivalent measure was not available for Dutch listening conditions. It is possible that participants’ self-described ability to follow is less informative than comprehension questions or word reports. Another consideration is that, although English and Dutch are mutually unintelligible, they do share many suprasegmental properties in common. As we wished to minimize acoustic and linguistic differences between intelligible and non-intelligible conditions, this was a deliberate choice for our experimental design. Prosody, although not necessarily lexically informative, nonetheless conveys salient information, ranging from the rhythm and structure of words and phrases, to emotional affect, to the pacing of action and dialogue. Hence, although listeners in our study reported very little understanding or sense of engagement relating to the Dutch stimuli, our decoding results probably still capture language- and speech-specific neural responses beyond strictly acoustic processing (Myers et al., 2019).
Equivalent Envelope Reconstruction of Unprocessed and Spectrally Degraded Speech
Based on the literature, we formed different hypotheses for spectral degradation in the current study. In one scenario, the neural decoding of Dutch speech worsens with declining acoustic clarity, possibly because listeners are unable to benefit from semantic prediction or other top-down processes. As it turned out, at the group level, speech tracking level was similar across unprocessed, vocoded, and vocoded speech with additional blurring distortion. Given that the broadband amplitude envelope was more or less identical across spectral degradation conditions, our result speaks to the dominance of this acoustic feature in the speech-evoked EEG (Ding & Simon, 2014). For example, Prinsloo and Lalor (2022) showed that, even for artificial speech stimuli where the envelope conveys no meaningful information, it evokes a stronger cortical response in listeners than competing acoustic features that do facilitate comprehension.
One aspect in which our study differs from others is that we did not manipulate spectral resolution to the point where speech could no longer be understood at all. This preserves the orthogonality between acoustic quality and speech understanding, but challenges straightforward comparison to previous results. For instance, whereas some authors reported no differences in the neural response to unprocessed and four-channel vocoded speech (Dong & Gai, 2021), other groups did see decoding accuracy decrease when speech was vocoded with five or fewer channels (Chen et al., 2023). Hence, speech may not have been spectrally degraded enough in the current experiment to elicit differences in decoding accuracy, assuming it is indeed spectral resolution and not intelligibility that impacted stimulus reconstruction (Chen et al., 2023). It is, however, possible that we would observe spatial and temporal differences when characterizing the EEG response itself. This idea is supported by evidence from studies that used encoders and other models wherein weights and time lags are representative of neural activation patterns, and can, therefore, be interpreted by researchers (Holdgraf et al., 2017), noting that these additional parameters—some of which are manually selected—could complicate their use in clinical settings. For example, behaviorally measured comprehension was the same for unprocessed and 16-channel vocoded speech, but the modelled neural response was more similar between 16-channel and 8-channel vocoded speech than between unprocessed and 16-channel vocoded speech (Kong et al., 2015). Another study employing invasive stereoelectroencephalography (sEEG) found an early high-gamma (60–140 Hz) power response, localized to primary auditory cortex, that responded preferentially to vocoded speech, and a later theta (5–8 Hz) response, localized to nonprimary auditory areas mainly in the right hemisphere, that responded preferentially to unprocessed speech (Xu et al., 2023). In addition to characterizing the temporal profile or waveform of the neural response, there are other measures that reveal differences between unprocessed and vocoded speech processing, such as speech–brain coherence, which describes the phase relationship between stimulus features and the neural response. Studies employing this technique suggest that vocoding speech shifts the coherence centre frequency from the syllabic rate (∼4 Hz) towards the slightly faster acoustic modulation rate (∼5 Hz; Chen et al., 2023; Schmidt et al., 2021). Together, these findings imply the possibility of distinct auditory processing pathways for unprocessed and vocoded speech, despite their equivalent amenability to neural decoding. In particular, Xu et al. (2023) argued that the sEEG data reveal a functional dissociation between automatic processing of low-level acoustic information, and a slightly later response that is selective for intelligible speech; however, as speech intelligibility was fully confounded with acoustic clarity in their study, it is unknown whether spatial–temporal changes in the neural response are the result of waning comprehension or increasing spectral degradation. A shift in perceptual attention from one putative system to the other could, in theory, contribute to the sudden “pop-out” listeners sometimes experience when hearing vocoded speech (Corcoran et al., 2023; Davis et al., 2005; Karunathilake et al., 2023; Kösem et al., 2023). Presumably, under ecologically relevant conditions, both the automatic acoustic and speech-selective systems would continuously interact at multiple stages of auditory processing. Of relevance to our results, it is possible that the early, acoustically driven response contributed to statistical significance detection in subject-specific decoding of Dutch speech, which tended to improve with decreasing spectral resolution.
Decoding Within and Across Individuals and Listening Conditions
By training and testing within and across listening conditions, we were able to quantify the generalizability of speech tracking. Although subject-specific and group-level decoding showed good within-individual correlation, group decoders produced comparably low accuracy values. They were also less discriminating with respect to listening condition, to the extent that there was a minimal effect of training the group decoders on English versus Dutch speech. Approximately one third of group decoders, furthermore, failed to meet statistical significance according to random permutation testing. When comparing decoding across the entire experiment (Figure 5), it can be seen that some individuals particularly benefit from subject-specific decoding, at least with regards to accuracy. Potential clinical implications, however, will also depend on the relative robustness of decoding—it is possible that subject-specific decoders are overfit—and the ability to capture specific aspects of the neural response that are relevant to speech outcomes. While these preliminary results do not support the clinical application of neural decoding to assess fine spectral detail or estimate speech comprehension, both subject- and group-level decoders showed reliable tracking for the majority of participants, and could therefore be used as a more fundamental measure of auditory speech processing.
As the acoustic quality of test speech becomes less similar to training speech, decoding accuracy wanes. This pattern is symmetrical across the three levels of spectral degradation, with no apparent advantage unique to training and testing on unprocessed speech. That result stands in contrast to intelligibility, where we do record consistently better decoding for the language that is understood, regardless of whether the model was trained on intelligible speech. There was, however, substantial variability in the neural response to spectral degradation between participants (Gransier et al., 2021). This is evidenced in the current data set by the wide confidence intervals for the model estimates and effect sizes of spectral degradation-related predictors. Given the individuality and diversity of neural responses observed in the present dataset, which was contributed by typically hearing younger adults, it seems unlikely that a single normative or “healthy” decoder would be suitable for clinical or non-neurotypical populations—at least for the purposes of assessing sensory-acoustic clarity in speech processing.
Our findings do underscore the potential clinical viability of subject-specific decoding, for which statistically significant decoders can be trained using ∼10 min of data per listening condition. Although the absolute magnitude of neural decoding accuracy is unlikely to be meaningful at the population level, we can envision a paradigm wherein speech or speech-like stimuli are manipulated on a specific acoustic dimension. Establishing the patient's individual threshold at which this manipulation can be detected in the EEG response could form the basis of an objective measure of auditory processing, for example (Goehring et al., 2020, 2021). Given the observed substantial variability across spectral degradation conditions within individuals, it seems unlikely that neural decoding can reveal differences in spectral resolution for individual CI listeners when predicting the speech amplitude envelope. Still, it may serve as a principal measure of speech envelope perception during naturalistic speech listening independent of speech understanding or spectral resolution and therefore be helpful in the CI device fitting process. CI speech processing relies fully on the successful transmission of speech envelope information across a limited number of frequency channels and an objective measure of this capacity could serve as a crucial tool to support the device fitting and validation process in clinics. It may be particularly useful for patients with limited verbal communication skills, for whom device fitting can be very challenging. The previously mentioned CI stimulation artifacts complicate EEG-based measurements in CI listeners, but methods to mitigate CI-related artifacts with the use of stimulation gaps (Somers et al., 2018), computational artifact removal (Intartaglia et al., 2022) or interleaved stimulation (Carlyon et al., 2021; Guérit et al., 2023) have recently been proposed and successfully applied to measure neural responses in CI listeners with EEG. A combination of artifact removal techniques with objective, EEG-based measures of speech transmission warrant further research but represent a promising venue to develop future clinical applications.
Conclusion
An objective measure of neural speech encoding function would benefit patients and clinicians, as well as researchers involved in hearing sciences or the cognitive neuroscience of speech. However, for an objective measure to be useful, it should also work for individuals for whom the speech signal is sparse or distorted, such as CI listeners. Thus far, efforts to link physiology with subjective and objective behavioral measures of speech perception have been limited by confounding factors, such as acoustic clarity and prior knowledge. In this study, we have attempted to control these and other dimensions that may covary with intelligibility. The results show that, at the sample level, the amplitude envelope is robustly represented within the neural response to speech, even when speech is not understood, and that this response can be detected in the context of severe spectral degradation. However, we also find limited evidence for a relationship between speech comprehension and neural decoding of the envelope, with slightly but significantly higher values in the intelligible language. We also observe substantial between-participant variation in the brain response to different listening conditions. It is possible that within-participant analyses, whether by permutation testing, or presenting listeners with parameterized stimuli, may yield greater insights into hearing health and speech reception than the interpretation of the absolute accuracy of neural decoding per se. We conclude that while more research is warranted to understand the sources of variability observed in these data, neural decoding of the speech envelope may serve as a basic objective measure of auditory speech processing during CI listening and should be further investigated.
Supplemental Material
sj-docx-1-tia-10.1177_23312165241266316 - Supplemental material for Neural Decoding of the Speech Envelope: Effects of Intelligibility and Spectral Degradation
Supplemental material, sj-docx-1-tia-10.1177_23312165241266316 for Neural Decoding of the Speech Envelope: Effects of Intelligibility and Spectral Degradation by Alexis Deighton MacIntyre, Robert P. Carlyon and Tobias Goehring in Trends in Hearing
Footnotes
Acknowledgments
This research was funded by the Wellcome Trust Institutional Strategic Support Fund (ADM) and the Medical Research Council UK Career Development Award Fellowship MR/T03095X/1 (TG). For the purpose of open access, the authors have applied a Creative Commons Attribution (CC BY) license to any Author Accepted Manuscript version arising.
Data Availability
The data and acoustic stimuli that support the findings of this study are openly available at DOI 10.17605/OSF.IO/CZWJ9.
Declaration of Conflicting Interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by the Medical Research Council, Wellcome Trust (grant number MR/T03095X/1).
Supplemental Material
Supplemental material for this article is available online.
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
