Sage Journals: Discover world-class research

Abstract

Auditory training can lead to notable enhancements in specific tasks, but whether these improvements generalize to untrained tasks like speech-in-noise (SIN) recognition remains uncertain. This study examined how training conditions affect generalization. Fifty-five young adults were divided into "Trained-in-Quiet" (n = 15), "Trained-in-Noise" (n = 20), and "Control" (n = 20) groups. Participants completed two sessions. The first session involved an assessment of SIN recognition and voice discrimination (VD) with word or sentence stimuli, employing combined fundamental frequency (F0) + formant frequencies voice cues. Subsequently, only the trained groups proceeded to an interleaved training phase, encompassing six VD blocks with sentence stimuli, utilizing either F0-only or formant-only cues. The second session replicated the interleaved training for the trained groups, followed by a second assessment conducted by all three groups, identical to the first session. Results showed significant improvements in the trained task regardless of training conditions. However, VD training with a single cue did not enhance VD with both cues beyond control group improvements, suggesting limited generalization. Notably, the Trained-in-Noise group exhibited the most significant SIN recognition improvements posttraining, implying generalization across tasks that share similar acoustic conditions. Overall, findings suggest training conditions impact generalization by influencing processing levels associated with the trained task. Training in noisy conditions may prompt higher auditory and/or cognitive processing than training in quiet, potentially extending skills to tasks involving challenging listening conditions, such as SIN recognition. These insights hold significant theoretical and clinical implications, potentially advancing the development of effective auditory training protocols.

Keywords

auditory training learning specificity voice cues speech in noise

Introduction

Perceptual training can improve speech recognition in noisy environments. It involves enduring changes within the perceptual system that enhance the response to stimuli and require adjustments in internal representations based on experience (Goldstone, 1998; Herszage & Censor, 2018; Irvine, 2018). However, while studies involving auditory or cognitive training typically show significant improvements in trained tasks, evidence for generalization of the learning gains to untrained tasks, such as speech-in-noise (SIN) recognition are mixed (Henshaw & Ferguson, 2013; Jacoby & Ahissar, 2015; Lawrence et al., 2018; Simons et al., 2016). Assessing the scope of generalization could provide insight into the neural processes undergoing reorganization following training (Ahissar et al., 2009; Censor, 2013; Karni, 1996; Karni & Sagi, 1993) and bear clinical significance for developing effective training protocols (Henshaw et al., 2022). The current study sought to investigate the level of generalization achieved through two training protocols involving an auditory voice discrimination task, emphasizing critical acoustic cues for SIN recognition (Bronkhorst, 2015; Darwin et al., 2003; Vestergaard et al., 2009). One protocol operated in quiet settings, whereas the other mirrored the first but with added background noise, the latter potentially encompassing higher auditory or cognitive processing.

There is limited understanding regarding the circumstances facilitating the generalization of perceptual learning (Irvine, 2018). Various factors have been proposed to impact generalization, including the specific level of processing. This level refers to the depth or complexity of cognitive operations involved in transforming and manipulating incoming information targeted during training (e.g., lower-level auditory processing, where basic sensory information is encoded and analyzed, or higher-level cognitive processing, involving complex reasoning, decision-making, and memory retrieval, Henshaw et al., 2022; Lawrence et al., 2018; Van Wilderode et al., 2023). Other suggested factors include the extent of overlap in brain representations between the trained and untrained tasks (Amitay et al., 2014; Hesseg et al., 2016), the degree of variability within the training materials (Amitay et al., 2005; Banai & Amitay, 2012; Van Wilderode et al., 2023), the number of training sessions (Jeter et al., 2010; Korman et al., 2003; Zaltz, Goldsworthy et al., 2020), and the number of trials conducted within each session (Censor & Sagi, 2009; Molloy et al., 2012).

The accepted notion suggests that while there is no straightforward rule to predict the generalization pattern for a specific task (Wright & Zhang, 2009), generalization tends to occur when the neural changes induced by training extend beyond neurons activated solely by the training stimuli (Irvine, 2018). Accordingly, when generalization occurs, the presumption is that learning takes place in high-level neural processing, where neurons respond independently to various stimulus features. This leads to representations that respond similarly to stimuli differing in these features, allowing learning gains to transfer to untrained stimuli or tasks (Dudai et al., 2015; Pinsard et al., 2019). Conversely, if the learning gains remain specific to the trained conditions or tasks, it implies that neural changes induced by training occurred in lower-level processing and representation. This level is characterized by selectivity for fundamental input features (Ahissar & Hochstein, 1997, 2004; Ahissar et al., 2009).

Voice discrimination (VD) offers a unique task to investigate the extent of generalization following perceptual training, given its significance in enhancing speech perception in noisy environments (Bronkhorst, 2015). Among the voice cues for VD, the fundamental frequency (F0), influenced by vocal cord length, mass, and vibration rate, and formant frequencies shaped by vocal tract length, emerge as significant contributors (Başkent & Gaudrain, 2016; Darwin et al., 2003; Mackersie et al., 2011; Schvartz-Leyzac & Chatterjee, 2015; Skuk & Schweinberger, 2013; Vestergaard et al., 2009, 2011). Both cues convey substantial information about the speaker, including characteristics such as age, gender, and individual traits (Başkent & Gaudrain, 2016; Shultz, 2015; Skuk & Schweinberger, 2013; Smith et al., 2007; Smith & Patterson, 2005). Moreover, studies involving individuals with normal hearing demonstrate their strong reliance on F0 and formant frequencies for speaker discrimination (Koelewijn et al., 2021, 2023; Zaltz, Goldsworthy et al., 2020; Zaltz & Kishon-Rabin, 2022) and talker segregation (Başkent & Gaudrain, 2016; Darwin et al., 2003). Conversely, individuals facing difficulties in perceiving differences in F0 and formant frequencies, such as those with hearing impairments using cochlear implants, display reduced VD (El Boghdady et al., 2019; Gaudrain & Başkent, 2018; Zaltz et al., 2018). These limitations may potentially contribute to their struggles in listening amid noisy environments.

Research investigating the impact of auditory training on voice perception commonly utilized two approaches. Explicit approaches involve the intentional acquisition of voice information through tasks like voice identification, where listeners are asked to identify the speaker's identity based on specific voice characteristics. In contrast, implicit approaches entail an unintentional acquisition of voice information through voice familiarity training, where listeners are exposed to specific voices while focusing on linguistic content rather than voice characteristics. Both of these approaches provide evidence supporting the advantages of such training for SIN recognition (Kreitewolf et al., 2017; Nygaard et al., 1994; Yonan & Sommers, 2000). Notably, even a brief 10-min voice training session has been shown to enhance speech intelligibility (Holmes et al., 2021). However, there are some indications that the results of such training remain highly specific to the tasks that were trained (Biçer et al., 2023; Yonan & Sommers, 2000). For example, a two-session voice identification or voice familiarization training with sentence stimuli proved beneficial for identifying words within sentences but not for recognizing isolated words (Yonan & Sommers, 2000). The authors accounted for this specificity by proposing that sentence-based training directed attention toward prosodic and rhythmic features absent in isolated words. Specificity in learning gains after voice familiarization training in quiet was also reported recently by Biçer et al. (2023). In their study, the inclusion of a training protocol involving approximately 30 min of listening to an audiobook segment did not yield a significant impact on VD when the reference voice matched the trained (audiobook) voice. The authors posited that this learning specificity could be attributed to the divergence in linguistic materials between training, which involved meaningful sentences, and testing, which comprised isolated, meaningless consonant–vowel (CV) triplets. Nevertheless, a noteworthy observation was a substantial increase in pupil dilation during VD with untrained voices compared to trained voices using vocoder-degraded speech. This finding suggests a training-induced advantage for the familiar voice in terms of reduced listening effort.

The present study aimed to advance our understanding of the constraints that influence generalization following auditory training and offer new insights into how the training conditions affect generalization. To this end, two participant groups underwent a two-session VD training alternating between F0 or formant cues (interleaved training). One group received training in quiet conditions, while the other was trained in background noise, the latter requiring increased engagement of high-level auditory and/or cognitive processing (Eckert et al., 2016; Herrmann & Johnsrude, 2020). The extent of generalization of the learning gains was evaluated across several differences between the trained and untrained tasks: (a) Acoustic cues for VD: comparing a single cue (F0/formants) versus the combined use of both, (b) Stimuli: contrasting sentences with words, and (c) Training conditions: distinguishing between quiet conditions, which primarily engage low-level processing, and background noise which additionally involves higher-level auditory and/or cognitive processing. Performance in the generalization tasks was assessed by comparing it to a control group that did not undergo any training. The hypothesis posited that the interleaved training would enhance VD based on a single acoustic cue, drawing from recent findings in interleaved interaural time difference (ITD) and interaural level difference (ILD) discrimination training (Ning & Wright, 2023). However, differences were expected in the extent of generalization of the learning gains across the various tasks and conditions, highlighting the linkage between the specificity of learning and the level of processing targeted during training (Henshaw et al., 2022; Lawrence et al., 2018; Van Wilderode et al., 2023).

Materials and Methods

Participants

Fifty-five young adults (28 females and 27 males), aged 18–33 years, were recruited for this study. The participants were assigned using a semirandomized method to one of three groups: (a) The "Trained-in-Quiet" group (mean age of 24.53 years ± 3.72, n = 15); (b) The "Trained-in-Noise" group (mean age of 24.61 years ± 3.89, n = 20); and (c) The "Control" group (mean age of 24.15 years ± 3.55, n = 20). Initially, participants were assigned to one of the three groups in a sequential order (group 1–2–3–1–2…). However, due to challenges in participant recruitment and the primary focus of the study on the effect of noise on VD learning and generalization, after reaching 15 participants in each group, randomization continued exclusively between the Trained-in-Noise and Control groups.

Sample size justification: based on previous studies, the group * learning effects (described below) were assumed to be large (f = .76). A power analysis was carried out using G*Power 3.1.9.2. (Faul et al., 2007). The minimum total sample size required to detect such an effect for an ANOVA mixed design with three groups and two measurements, assuming a .01 probability of Type I error and a power of .95, is 15. Therefore, the current sample, consisting of a total of 55 participants, was adequately powered.

All participants met specific inclusion criteria, including the following: (a) bilateral pure-tone air-conduction thresholds at octave frequencies of 500–4,000 Hz were ≤20 dB hearing level (ANSI, 2018); (b) participants were native Hebrew speakers; (c) they had completed at least 12 years of education; (d) there was no history of language or learning disorders; (e) no known attention deficit disorders were present; (f) participants had less than 1 year of musical training; and (g) participants had no prior experience with psychoacoustic testing. Background information related to criteria (a)–(g) was obtained through self-reported data gathered via a comprehensive questionnaire. Participants were not compensated for their participation in the study. Informed consent was obtained from all participants.

Speech Recognition Thresholds in Noise (SRTn)

To determine the signal-to-noise ratio (SNR) for the group who VD-trained in noise and evaluate the effect of VD-training on SIN recognition across participant groups, speech recognition thresholds in noise (SRTn) were measured for all participants utilizing a sentence-in-noise test (previously described in Levin et al., 2022; Levin & Zaltz, 2023; Zaltz, 2023).

Stimuli for the SRTn Test

For the present study, a set of 96 sentences specifically developed for this purpose was employed, each consisting of three disyllabic words in Hebrew adhering to a fixed grammatical structure (noun, verb, and adjective). A preliminary assessment confirmed that these sentences utilized a straightforward vocabulary suitable for young native-speaking adults. The sentences were recorded by a female native speaker in a soundproof room, using an AT-892-TH microphone and Sound-Forge software (Version 7.0), with stereo channels at a sampling rate of 44,100 Hz and a 16-bit quantization level. To maintain uniform intensity levels across the stimuli, amplitudes were normalized to −16 dB RMS. Each SRTn assessment included approximately 20–25 sentences selected from the pool of 96 sentences. The presentation order of the sentences was pseudorandomized, ensuring that each sentence was presented only once during an assessment. The background noise utilized was a continuous, speech-shaped noise (SSN) with a long-term spectrum that matched the long-term spectrum of the speech material.

SRTn Assessment

The sentences were presented at a constant level, while the noise levels were adjusted using a two-down, one-up adaptive process to determine the SNR corresponding to 70.7% sentence recognition on the psychometric function (Levitt, 1971). The tested SNR range spanned from +3 to −12 dB. Initially, sentences were presented at an SNR of +3 dB. Listeners were instructed to verbally repeat what they heard as accurately as possible, and they were encouraged to make educated guesses if unsure. There was no time limit for their responses. Based on the listener's responses, the tester marked the correctly repeated words on the computer, and subsequently the level of the next sentence was adjusted. Correctly repeating two or three words within a sentence was considered as correct sentence recognition, whereas correctly repeating zero or only one word within a sentence was regarded as incorrect sentence recognition. The initial SNR step was calculated as the difference between the easiest and most challenging SNRs, resulting in a 15 dB step (3 − (−12) = 15 dB). This step was halved following two consecutive correct responses or one incorrect response, down to the second reversal, reducing the step size to 7.5 dB, and then further to 3.75 dB. For instance, following two correct responses at the easiest SNR (+3 dB), the SNR was reduced from 3 dB to −4.5 dB (3 − 7.5 = −4.5). If the participant continued to give two correct responses, the SNR was further reduced from −4.5 dB to −8.25 dB (−4.5 − 3.75 = −8.25). However, if the participant provided one incorrect response, the SNR was increased from −4.5 dB to −0.75 dB (−4.5 + 3.75 = −0.75). For the next reversal, the SNR step was reduced by a factor of 1.41 (√2) to 2.66 dB, and for the subsequent reversals (n = 3), the SNR step was reduced by a factor of 1.19 (∜2) to 2.23 dB. The adaptive procedure concluded after six reversals, and the Speech Recognition Threshold in noise (SRTn) was computed as the mean of the last four reversals. A similar adaptive procedure was previously employed to evaluate SRTn in children, young adults with normal hearing, and cochlear implant users (Levin et al., 2022; Levin & Zaltz, 2023; Zaltz, 2023).

Voice Discrimination (VD) Test

Stimuli for the VD Test

The VD test employed in this study was previously detailed (Zaltz, 2023). This test comprises either monosyllabic consonant–vowel–consonant words from the Hebrew version of the Arthur Boothroyd test (HAB) test (Kishon-Rabin et al., 2004) or sentences featuring three disyllabic words in Hebrew following a fixed grammatical structure (noun, verb, and adjective). All stimuli were recorded under identical conditions to those employed for the SRTn test to minimize acoustic differences between the tests. Specifically, the stimuli were recorded in the same room, by the same female speaker, using the same microphone, and at the same sampling rate and quantization level. The stimuli were modified within a 14-point stimulus continuum in either F0 alone, formant frequencies exclusively, or both F0 and formants, using the Pitch Synchronous Overlap-Add (PSOLA) algorithm (Moulines & Charpentier, 1990) implemented in the PRAAT software version 5.4.17 (copyright^© 1992–2015 by Paul Boersma and David Weenink). This continuum exponentially ranged in √2 steps from a change of −0.127 semitone to a shift of −8 semitones, a technique similar to previous articles (Levin & Zaltz, 2023; Zaltz, 2023; Zaltz et al., 2018, 2020; Zaltz & Kishon-Rabin, 2022). Specifically, the mean F0 was altered by 0, −0.127, −0.18, −0.26, −0.36, −0.51, −0.72, −1.02, −1.44, −2.02, −2.86, −4.02, −5.67, and −8 semitones from the mean F0 of the original stimulus, using the PRAAT's Manipulation editor. For example, if a sentence had a mean F0 of 175.62 Hz, lowering the F0 resulted in the comparison sentence transitioning exponentially in √2 steps from 174.33 to 110.35 Hz. When manipulating the formant frequencies in this sentence, they were adjusted exponentially, ranging from a ratio of 0.99 (the smallest change from the original formant frequencies) to a ratio of 0.63 (the most significant change), using the PRAAT's Change Gender editor. The latter adjustment necessitated resampling the stimulus to compress the frequency axis using a range of factors similar to the F0 change. Following this, the PSOLA algorithm was applied to restore the original pitch and duration. Combined F0 + formant changes were performed using first the Change Gender editor and then the Manipulation editor.

VD Threshold Assessment

The three-interval three-alternative forced choice (3I3AFC) method was used to assess VD Just Noticeable Differences (JNDs) based on either F0, formants, or the combination of F0 and formant cues. Each trial involved the presentation of two reference (unprocessed) stimuli and one comparison (manipulated) stimulus, which were timed with a 300-ms interstimulus interval. A corresponding square on the PC monitor was highlighted when a stimulus was presented. Participants were instructed to select the stimulus they perceived as "sounding different" by clicking on the respective square using a computer mouse. Because these instructions focused on voice characteristics rather than linguistic content, the task was categorized as an explicit voice perception task. The initial comparison stimulus featured the most significant manipulation, with F0 lowered by eight semitones and formants adjusted by a ratio of 0.63. An adaptive tracking procedure following a two-down, one-up pattern was employed to determine difference limens (DLs) that corresponded to a 70.7% detection threshold on the psychometric function (Levitt, 1971). The difference between the reference and comparison stimuli was successively halved until the first reversal, and subsequently, it was adjusted using a √2 factor until the sixth reversal. This process resulted in an average of approximately 25 trials for each VD assessment, totaling around 75 stimuli. JNDs were calculated as the average of the last four reversals, as documented in previous studies (Levin & Zaltz, 2023; Zaltz, 2023; Zaltz et al., 2018, 2020; Zaltz & Kishon-Rabin, 2022). Participants received no feedback during the task, and there was no time restriction for making their selections.

For the group who trained in noise, VD training was carried out using the SSN employed in the SRTn test. This noise was presented at an SNR set approximately 5 dB above the individual SRTn, denoted as SRTn +5 dB. It was assumed that this SNR corresponded to over 90% speech recognition based on recent findings that established speech recognition functions for young adults with normal hearing (Sobon et al., 2019).

Study Design

Assessments

All participants underwent two assessments in two separate sessions, spaced 1–3 days apart (Figure 1). The first (baseline) assessment included a short hearing test to ensure air-conduction thresholds within the normal range bilaterally at octave frequencies of 500–4,000 Hz (ANSI, 2018). Following the hearing test, participants underwent two SRTn evaluations and four VD evaluations in quiet conditions, with combined F0 + formant cues: two with word stimuli and two with sentence stimuli. The order of tests and stimuli was balanced across participants. The mean pure-tone air-conduction thresholds at 500 Hz, 1,000 Hz, and 2,000 Hz (PTA) were individually calculated for each ear, serving as a baseline to determine the presentation level for the SRTn and VD tasks.

Figure 1.

Study design. Participants are divided into three groups—Trained-in-Quiet (n = 15), Trained-in-Noise (n = 20), and Control (n = 20). Two sessions were conducted 1–3 days apart. First session: baseline (first) assessment for all groups and interleaved training phase for trained groups only. Second session: interleaved training phase for trained groups and second assessment for all three groups. Trained-in-Quiet group, trained in quiet; Trained-in-Noise group, trained amidst background noise. VD = voice discrimination, F0 = fundamental frequency, SRTn = speech recognition thresholds in noise.

The second assessment comprised two SRTn evaluations and four VD evaluations based on combined F0+ formant cues, mirroring the tasks from the first assessment.

Training

Two interleaved training phases were administered only to the trained groups. The first phase occurred immediately after the initial assessment during the first session. Subsequently, the second phase took place before the second assessment during the second session (Figure 1). Each training phase encompassed six VD evaluations utilizing sentence stimuli. These evaluations alternated between three VD tests employing F0 cues and three utilizing formant cues, following these sequences: F0–formants–F0–formants–F0–formants, F0–F0–formants–F0–F0–formants, formants–F0–formants–F0–formants–F0, and formants–formants–F0–formants–formants–F0, evenly distributed among participants. Training took place in a quiet setting for the Trained-in-Quiet group, while the Trained-in-Noise group underwent training under background noise conditions as detailed above.

Each session lasted approximately 85–100 min for the trained groups and 45–55 min for the control group, including one to three short breaks, as needed. The study received approval from the Institutional Review Board of the University.

Apparatus

The stimuli were delivered using the internal soundcard of a laptop personal computer through a GSI-61 audiometer to both ears via TDH-49 headphones. The stimuli were presented at approximately 35–40 dB sensation level (above individual PTA of each ear), to balance sensation level across participants. The testing and training sessions were conducted in a sound-treated, single-walled room. The experimenter sat alongside the participants, closely monitoring, and recording their verbal responses.

Data Analysis

Statistical analyses were carried out using SPSS-20 software (IBM Corp, Armonk, NY). A single participant from the control group had VD data deviating by over 2.5 standard deviations (SDs) from the group mean and was consequently excluded from the data analysis. Before performing the analyses of variance (ANOVAs), normality checks were conducted on the residuals of all dependent variables using one-sample Kolmogorov–Smirnov and Shapiro–Wilk tests (for a total sample size of N = 54). The results of these tests, along with a visual inspection of the residuals using Normal Q–Q plots, indicated that the residuals for the SRTn variables exhibited a normal distribution (p-values > .05). However, several residuals for the VD variables did not meet the normality assumption, specifically VD with word stimuli in the second session for both the control group and the Trained-in-Noise group, and VD with sentence stimuli in the second session for the Trained-in-Noise group. Consequently, all VD data underwent logarithmic transformation before being entered into the ANOVAs, following the approach used in previous studies (El Boghdady et al., 2018; Koelewijn et al., 2023; 2021). Subsequent analysis revealed that the residuals of the log-transformed VD data conformed to a normal distribution (p-values > .05), ensuring that all the data adhered to the assumptions required for ANOVA.

To assess learning with the trained task, repeated measures ANOVA (RM-ANOVA) was conducted on the VD thresholds obtained during the training phase. Acoustic cue (F0 only, formants only), session (1, 2), and block (1–3) were considered as within-subject variables, while training conditions (quiet, noise) served as the between-subject variable. To investigate the generalization of learning to untrained tasks compared to the control group, two additional RM-ANOVAs were performed on data collected during the two assessments. The first RM-ANOVA was performed on the average of the two VD blocks associated with each stimulus. The within-subject variables were stimulus (words, sentences) and session (1, 2) and the between-subject variable was group (Trained-in-Quiet, Trained-in-Noise, control). The second RM-ANOVA focused on the average of the two SRTn blocks, with session (1, 2) as the within-subject variable and group (Trained-in-Quiet, Trained-in-Noise, control) as the between-subject variable. All post hoc analyses were conducted using the Šidák correction for multiple comparisons.

Results

Learning

VD thresholds for the two trained groups during the training phase are detailed in Table 1. Statistical analysis revealed a significant effect for voice cue [F(1, 33) = 10.907, p = .002, ƞ² = .254], indicating better thresholds for formant compared to F0 cues. In addition, a significant effect of the session [F(1, 33) = 13.939, p = .001, ƞ² = .303] was observed, with better thresholds in the second session than in the first session, indicating significant learning. No significant effect was observed for the block [F(2, 33) = .177, p = .823]. Figure 2(a) illustrates VD performance for each voice cue in each session. Note that due to the absence of a significant block effect, the performance in each session is depicted as the average of the three blocks within that session. Although a visual inspection suggests better performance in quiet than in noise, the effect of training conditions was not found to be significant [F(1, 33) = 2.924, p = .068]. Furthermore, no significant interactions were detected (p > .05).

Figure 2.

Average voice discrimination thresholds (±SE). (a) Training session data (mean of blocks 1–3 per session with each voice cue) for both training groups. (b) Assessment data (mean of two blocks per session with each task) for the training and control groups. F0 = fundamental frequency, SE = standard error.

Table 1.

Average Voice Discrimination Thresholds (in Semitones) in the Three Blocks Conducted During Each of the Two Training Sessions With Each Voice Cue for Both Training Groups (in Parentheses Are the Standard Deviations).

Acoustic cue	Condition	Blocks
		1st training session			2nd training session
		1	2	3	1	2	3
F0	Quiet	1.30 (0.13)	1.39 (0.17)	1.50 (0.21)	1.28 (0.21)	1.17 (0.16)	0.92 (0.10)
F0	Noise	1.76 (0.24)	1.30 (0.15)	1.61 (0.16)	1.23 (0.15)	1.43 (0.19)	1.44 (0.16)
Formants	Quiet	0.98 (0.16)	1.05 (0.11)	1.00 (0.14)	0.97 (0.13)	0.90 (0.14)	0.84 (0.11)
Formants	Noise	1.26 (0.13)	1.12 (0.10)	1.15 (0.08)	1.02 (0.13)	1.03 (0.15)	0.97 (0.11)

Generalization

VD thresholds for the two trained groups and the control group during the first and second assessments are outlined in Table 2 (a). Figure 2(b) visually represents the mean two blocks for each stimulus (words/sentences) in each assessment. Statistical analysis revealed a significant effect for session [F(1, 51) = 32.175, p < .001, ƞ² = .387], with better thresholds in the second session than the first session, signifying significant learning. No significant effects were found for group [F(2, 51) = .076, p = .927] or stimulus [F(1, 51) = 1.689, p = .200], and no significant interactions were detected (p > .05).

Table 2.

(a) Average Voice Discrimination Thresholds (in Semitones) in the Two Blocks Conducted During Each of the Two Assessment Sessions With the Word and Sentence Stimuli, Across the Three Study Groups. (b) Average SRTn (in dB) in the Two Blocks Conducted During Each of the Two Assessment Sessions, Across the Three Study Groups (in Parentheses are the Standard Deviations).

(a)
Stimuli	Groups	Blocks
		1st assessment		2nd assessment
		1	2	1	2
Words	Trained-in-Quiet	1.13 (0.17)	0.95 (0.11)	0.71 (0.12)	0.63 (0.11)
	Trained-in-Noise	0.98 (0.12)	0.95 (0.12)	0.67 (0.11)	0.77 (0.15)
	Control	1.02 (0.15)	0.80 (0.10)	0.80 (0.14)	0.78 (0.12)
Sentences	Trained-in-Quiet	0.77 (0.08)	0.67 (0.09)	0.63 (0.08)	0.57 (0.11)
	Trained-in-Noise	1.02 (0.13)	0.84 (0.15)	0.58 (0.12)	0.64 (0.13)
	Control	0.72 (0.08)	0.80 (0.09)	0.65 (0.09)	0.74 (0.13)

(b)
Groups	Blocks
	1st assessment		2nd assessment
	1	2	1	2
Trained-in-Quiet	−4.24 (0.41)	−5.22 (0.26)	−5.61 (0.49)	−6.16 (0.45)
Trained-in-Noise	−3.79 (0.26)	−4.43 (0.27)	−6.28 (0.37)	−7.01 (0.37)
Control	−3.07 (0.26)	−4.32 (0.35)	−5.00 (0.25)	−5.97 (0.27)

SRTn for the two trained groups and the control group during the two assessments are detailed in Table 2(b). Statistical analysis revealed significant effects for group [F(2, 51) = 6.597, p = .003, ƞ² = .206] and session [F(1, 51) = 79.444, p = .001, ƞ² = .609], with a significant group × session interaction [F(2, 51) = 4.443, p = .017, ƞ² = .148]. Follow-up ANOVAs explored this interaction, with each analysis comparing two specific groups. Results revealed significant session × group interactions for the Trained-in-Noise group versus Trained-in-Quiet and the Trained-in-Noise group versus Control [F(1, 33) = 7.940, p = .008, ƞ² = .194; F(1, 37) = 5.236, p = .028, ƞ² = .124, respectively]. These interactions indicated that the Trained-in-Noise group exhibited a steeper slope between sessions, reflecting greater improvement in SRTn compared to both the Trained-in-Quiet and Control groups (as illustrated in Figure 3). Notably, there was no significant interaction observed for the Trained-in-Quiet group versus Control [F(1, 32) = 1.142, p = .293], suggesting a similar magnitude of improvement between these two groups across sessions.

Figure 3.

Average SRTn (±SE). Mean of two blocks per session for the training and control groups. SRTn = speech recognition thresholds in noise, SE = standard error.

Discussion

The primary aim of the current study was to investigate the impact of two training sessions with VD on the generalization of the learning gains to various tasks, differing in acoustic, linguistic, and cognitive demands from the trained task. This approach sought to shed light on the efficiency and limitations of auditory learning mechanisms in young adults with normal hearing. The key findings of the study are as follows: (a) Two sessions of interleaved VD training focused on a single voice cue (either F0 or formants) resulted in significant learning improvements, irrespective of the training conditions (quiet or noise). (b) Rapid learning was evident in VD utilizing combined F0 + formant cues between the first and second assessments. However, the learning gains for this task did not differ significantly between the training groups and the untrained control group. This lack of differentiation suggests no added advantage from single-cue training for combined-cue VD, indicating limited generalization. (c) Rapid learning was also observed in the SRTn task for both trained and control groups. However, the most significant enhancements were seen in the group that underwent VD training in background noise, suggesting increased generalization between tasks with challenging acoustic conditions.

The observed significant improvements in VD performance between the first and second training sessions, relying on either F0 or formant cues, align with numerous studies that have noted enhancements in pure-tone and F0 discrimination following training (Delhommeau et al., 2005; Lau et al., 2017; Menning et al., 2000; Micheyl et al., 2006; Roth et al., 2008; Wright & Sabin, 2007; Zaltz et al., 2018, 2020). These findings are also consistent with research showing significant learning gains following voice training including different tasks, such as voice identification or familiarization (Kreitewolf et al., 2017; Nygaard & Pisoni, 1998; Yonan & Sommers, 2000).

Interestingly, in the current study, significant improvements in VD between the initial and subsequent assessments were also observed for the combined F0 + formant task across both the trained and control groups, indicating rapid learning or adaptation. Rapid learning has been previously noted in studies focusing on speech perception with various forms of degraded speech, such as noise-vocoded, accented, smeared, or time-compressed speech (Banai et al., 2022; Borrie et al., 2012; Davis et al., 2005; Gordon-Salant et al., 2010). Moreover, research has shown that even as little as 10 min of voice training can lead to improvements in speech intelligibility (Holmes et al., 2021). Previous research has attributed such fast improvements to top-down tuning and adaptation processes that facilitate the formation of effective task-solving strategies and/or reduce response bias (Hauptmann et al., 2005; Hauptmann & Karni, 2002). A recent study further suggests that this rapid learning may be stimulus-specific, representing an essential initial stage of perceptual learning (Banai et al., 2022). Importantly, this study also demonstrated that rapid learning of time-compressed speech is associated with improved speech understanding in adverse listening conditions, enabling listeners to quickly adapt to continuously changing acoustic environments. The current findings of rapid improvements in VD therefore support the idea that brief exposure to tasks targeting voice characteristics may enhance speech perception in noisy environments. Future studies could explore this potential further.

It is important to note that the training approach in the present study was different from previous ones by incorporating two distinct acoustic cues: F0 and formant frequencies, in an interleaved manner. The rationale behind this approach stemmed from prior findings suggesting that increased variability in training materials could enhance generalization (Amitay et al., 2005; Banai & Amitay, 2012; Van Wilderode et al., 2023). Moreover, the alternating training between these voice cues appeared not to impede learning, as comparable improvements in VD were evident using either cue. This finding aligns with a recent study demonstrating learning for interleaved training between ITD and ILD discrimination tasks (Ning & Wright, 2023). However, it contradicts an earlier study that reported alternating training between two auditory temporal interval discrimination tasks with different standards to disrupt learning for both tasks (Banai et al., 2010). This suggests that the susceptibility of learning on a particular task to interference from training on another task may depend on the specific learning stage when the second task is introduced and the specific combination of tasks.

Despite incorporating some variability in the training material regarding voice cues, VD training with a single cue did not significantly enhance VD with both cues beyond the improvements observed in the control group, which were attributed to rapid learning mechanisms. This remained consistent for stimuli resembling the trained ones—three-word sentences—and for dissimilar stimuli like monosyllabic words. This outcome was rather surprising, especially considering that the VD training in the present study included voice familiarity training, utilizing the same voice for both training and testing. Numerous studies suggest that, in the realm of speech, acoustic similarity plays an important role in the generalization of learning (Bradlow et al., 2023; Strori et al., 2020; Tzeng et al., 2024). One possible explanation for this unexpected outcome may be linked to the distinct processing mechanisms involved with each acoustic cue for VD. Notably, formant coding predominantly engages spectral processing, while F0 coding relies on both spectral and temporal processing (Carlyon & Shackleton, 1994; Fant, 1960; Fu et al., 2004; Lieberman & Blumstein, 1988; Oxenham, 2008; Xu & Pfingst, 2008). Therefore, in a VD task utilizing both formant and F0 cues, a synthesis of these coding mechanisms is required. Thus, although this task may benefit from cue redundancy, as evidenced by the overall superior performance with combined cues compared to separate cues, it also necessitates higher-level integration beyond simple sensory processing. The limited generalization to combined cues beyond initial adaptation may suggest that the neural changes induced by single-cue training were rooted, to some extent, in lower levels of sensory processing. This rationale aligns with the notion that during training neural changes progressively shift from higher to lower processing levels, marked by specificity for fundamental features, thereby limiting generalization (Ahissar & Hochstein, 1997, 2004; Ahissar et al., 2009; Karni, 1996; Karni & Sagi, 1991). It is also supported by studies involving voice familiarization or identification training, which indicated a lack of generalization between training using meaningful sentences and subsequent testing employing isolated meaningful words (Yonan & Sommers, 2000) or meaningless CV-triplets (Biçer et al., 2023).

Notably, rapid learning was also evident in the SRTn task, across groups. However, while one might have initially speculated that both the Train-in-Noise and Train-in-Quiet groups would demonstrate significant improvements in SRTn compared to the control group, our findings revealed a more nuanced result. Specifically, only the Trained-in-Noise group, whose training conditions were identical to those utilized in the SRTn test, exhibited notable enhancements in SRTn compared to controls. Furthermore, this group surpassed the Trained-in-Quiet group in terms of SRTn improvements. This outcome suggests that SRTn performance was primarily influenced by the training conditions (noise) rather than the VD task or voice familiarization, which were shared by both the Trained-in-Quiet and Trained-in-Noise groups. A recent study revealed that only about 25%–55% of the variability in VD performance in the presence of SSN with a similar favorable SNR could be explained by VD performance levels observed in quiet environments among both children and adults (Kishon-Rabin & Zaltz, 2023). This suggests the engagement of additional processing mechanisms specifically utilized for VD in noisy conditions, beyond those utilized in quiet conditions. The Model of Listening Engagement proposes that as the listening environment becomes more intricate, for example, with added background noise, the demand on listening effort intensifies, necessitating greater involvement of cognitive functions, including working memory, inhibition, and selective attention (Herrmann & Johnsrude, 2020). Furthermore, neuroimaging studies that investigated speech recognition in challenging listening situations (such as degraded speech or background noise) consistently reveal the involvement of nonauditory neural systems. These systems support attention, performance monitoring, and optimization across various tasks (for a review see: Eckert et al., 2016). The superior SIN recognition demonstrated by the Trained-in-Noise group after training may therefore reflect improvements due to training in both low-level sensory processing of voice cues and higher-level nonauditory (cognitive) mechanisms such as focused attention on the relevant task while inhibiting background noise, particularly within this group. While the improvements at the low level remained specific to the trained task (without additional enhancement in the combined-VD task, as previously discussed), the adjustments in higher-level processing generalized, resulting in superior SIN recognition. This explanation aligns with the concept that the generalization of learning gains occurs when neural changes induced by training affect higher-level processing, where neurons respond similarly to stimuli varying in multiple features (Dudai et al., 2015; Pinsard et al., 2019). They are also in accordance with a recently introduced training approach that integrates both auditory and cognitive challenges to foster broad generalization across various auditory scenarios (Van Wilderode et al., 2023).

An alternative explanation regarding the transfer of learning improvements from the VD-in-noise task to SIN recognition proposes that the noise-based training enhanced either peripheral or central auditory processing of the speech signal, extending beyond the specific VD task. In terms of peripheral auditory processing, prior findings highlight the detrimental impact of background noise on both F0 and formant perception. Noise has been demonstrated to mask spectral information in speech signals (Brungart, 2001; Ezzatian et al., 2012), leading to poor identification of formant peaks within the speech signal's spectral envelope (Liu & Kewley-Port, 2004; Stelmachowicz et al., 1990; Swanepoel et al., 2012) and impeding VD based on formant cues (Kishon-Rabin & Zaltz, 2023). Additionally, it has been suggested to hinder the ability to exploit periodicity cues (Steinmetzger & Rosen, 2015) and decrease subcortical neural synchrony, potentially affecting phase-locking mechanisms (Dimitrijevic et al., 2013; Han et al., 2020). These effects might, in turn, impede F0 perception (Carlyon & Shackleton, 1994; Oxenham, 2008). VD training amidst noise may have improved these spectral and temporal processing mechanisms, thereby boosting VD and speech recognition, both highly dependent on F0 and formant perception. Although this rationale appears plausible, it is important to emphasize that in the current study, VD was trained at an SNR that was 5 dB higher (easier) than the individual SNR associated with 70.7% sentence recognition on the psychometric function, ensuring clear audibility of the speech signal (Abdel-Latif & Meister, 2021; Sobon et al., 2019). This approach aimed to minimize the possibility of energetic masking, thereby mitigating potential information loss due to noise at the peripheral level (Brungart, 2001).

The proposition that VD training in noise might have enhanced central auditory processing of the speech signal is supported by studies investigating SIN using auditory evoked potentials. These studies reveal that combining background noise with a speech signal typically reduces the amplitude and delays the latency peaks of N1, P2, and the mismatch negativity (MMN). This indicates that noise negatively impacts preattentive auditory processing of the speech signal, affecting sensory representation, classification, and discrimination (Gustafson et al., 2019; Kaplan-Neeman et al., 2006; Kozou et al., 2005; Martin et al., 1999; Muller-Gass et al., 2001). Furthermore, several studies detected notable alterations in evoked potential measures, encompassing changes in MMN latency, duration, and area after training involving SIN recognition tasks (Ceyhan et al., 2022; Kraus et al., 1995). Subsequent research endeavors could explore the proposition that training in VD within a noisy environment enhances central auditory processing of the speech signal, employing a combination of behavioral and physiological assessments.

Limitations and Suggestions for Future Studies

In the current study, participants were assigned to one of three study groups using a semirandomized procedure. This randomization method, resulting in uneven groups, may have introduced bias, and acted as a confounding factor, leading to differences in participants’ baseline (naïve) performance across the groups. Specifically, the Trained-in-Quiet group demonstrated superior initial SRTn performance compared to the other two groups, potentially having less to gain from the training. This limitation was partially addressed by focusing on improvements (slope of the difference between the two sessions) rather than absolute performance values. Indeed, despite initial performance differences, there was no significant difference in the magnitude of SRTn improvement between the Trained-in-Quiet and Control groups. Only the Trained-in-Noise group demonstrated greater improvement compared to both groups. However, for future studies aiming to replicate our work, researchers may consider allocating participants into groups of equal size, based on their initial performance with the task. This adjustment could further enhance the clarity of the observed training effects and strengthen the comparability of outcomes across different study groups. Additionally, VD training in the current study involved multiple JND assessments per session, with each assessment comprising a fixed number of reversals (six). However, the total number of stimuli was not predetermined to allow for a more precise determination of individual JNDs. While this flexibility may have introduced some variability in the results, future studies could consider implementing training protocols with a fixed number of stimuli to enhance consistency and comparability across participants. Finally, the SRTn assessment in the present study closely aligned with the conditions used for training the Trained-in-Noise group on VD, utilizing the same speaker and background noise. This methodological alignment enabled us to explore potential generalization across tasks that share similar acoustic environments. However, drawing broader conclusions about generalization to real-world speech perception scenarios requires further investigation and consideration.

Conclusions

The primary objective of the present study was to explore the impact of training conditions on the generalization of learning gains. The findings indicate that VD training in either quiet or background noise conditions can rapidly enhance performance. However, the generalization of learning gains following training in quiet conditions appears to be limited, suggesting that resultant neural changes may predominantly occur at a low processing level. Conversely, VD training conducted in more challenging listening conditions, such as background noise, has the potential to broaden transferability to tasks involving greater auditory or central challenges in the listening process, such as SIN recognition. This potential generalization may occur by prompting higher levels of auditory or cognitive processing. While the study exclusively examined young adults with fully developed auditory and cognitive systems, these novel insights underscore the advantages of employing auditory training tasks that engage multiple processing levels to broaden the generalization of the learning gains, offering practical benefits. Future investigations encompassing younger children, older adults with normal hearing, and individuals with auditory or cognitive impairments are necessary to evaluate whether similar patterns of generalization occur across different developmental stages and pathological conditions.

Footnotes

Acknowledgments

The author expresses gratitude to the undergraduate students from the University who assisted in data collection. Special appreciation is extended to all the subjects who participated in the study.

Data Availability

The datasets generated during and/or analyzed during the current study are available from the corresponding author upon reasonable request.

Declaration of Conflicting Interests

The author declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author received no financial support for the research, authorship, and/or publication of this article.

ORCID iD

Yael Zaltz

References

Abdel-Latif

K. H. A.

Meister

(2021). Speech recognition and listening effort in cochlear implant recipients and normal-hearing listeners. Frontiers in Neuroscience, 15, 725412. https://doi.org/10.3389/fnins.2021.725412

Ahissar

Hochstein

(1997). Task difficulty and the specificity of perceptual learning. Nature, 387(6631), 401–406. https://doi.org/10.1038/387401a0

Ahissar

Hochstein

(2004). The reverse hierarchy theory of visual perceptual learning. Trends in Cognitive Sciences, 8(10), 457–464. https://doi.org/10.1016/j.tics.2004.08.011

Ahissar

Nahum

Nelken

Hochstein

(2009). Reverse hierarchies and sensory learning. Philosophical Transactions of the Royal Society of London. Series B, Biological Sciences, 364(1515), 285–299. https://doi.org/10.1098/rstb.2008.0253

Amitay

Hawkey

D. J. C.

Moore

D. R.

(2005). Auditory frequency discrimination learning is affected by stimulus variability. Perception & Psychophysics, 67(4), 691–698. https://doi.org/10.3758/BF03193525

Amitay

Zhang

Y.-X.

Jones

P. R.

Moore

D. R.

(2014). Perceptual learning: Top to bottom. Vision Research, 99, 69–77. https://doi.org/10.1016/j.visres.2013.11.006

ANSI/ASA S3.6-2018 - Specification for Audiometers (n.d.). Retrieved August 19, 2023, from https://webstore.ansi.org/standards/asa/ansiasas32018.

Banai

Amitay

(2012). Stimulus uncertainty in auditory perceptual learning. Vision Research, 61, 83–88. https://doi.org/10.1016/j.visres.2012.01.009

Banai

Karawani

Lavie

Lavner

(2022). Rapid but specific perceptual learning partially explains individual differences in the recognition of challenging speech. Scientific Reports, 12(1), 10011. https://doi.org/10.1038/s41598-022-14189-8

10.

Banai

Ortiz

J. A.

Oppenheimer

J. D.

Wright

B. A.

(2010). Learning two things at once: Differential constraints on the acquisition and consolidation of perceptual learning. Neuroscience, 165(2), 436–444. https://doi.org/10.1016/j.neuroscience.2009.10.060

11.

Başkent

Gaudrain

(2016). Musician advantage for speech-on-speech perception. The Journal of the Acoustical Society of America, 139(3), EL51–EL56. https://doi.org/10.1121/1.4942628

12.

Biçer

Koelewijn

Başkent

(2023). Short implicit voice training affects listening effort during a voice cue sensitivity task with vocoder-degraded speech. Ear and Hearing, 44, 900–916. https://doi.org/10.1097/AUD.0000000000001335

13.

Borrie

S. A.

McAuliffe

M. J.

Liss

J. M.

(2012). Perceptual learning of dysarthric speech: A review of experimental studies. Journal of Speech, Language, and Hearing Research, 55(1), 290–305. https://doi.org/10.1044/1092-4388(2011/10-0349)

14.

Bradlow

A. R.

Bassard

A. M.

Paller

K. A.

(2023). Generalized perceptual adaptation to second-language speech: Variability, similarity, and intelligibility. The Journal of the Acoustical Society of America, 154(3), 1601–1613. https://doi.org/10.1121/10.0020914

15.

Bronkhorst

A. W.

(2015). The cocktail-party problem revisited: Early processing and selection of multi-talker speech. Attention, Perception & Psychophysics, 77(5), 1465–1487. https://doi.org/10.3758/s13414-015-0882-9

16.

Brungart

D. S.

(2001). Informational and energetic masking effects in the perception of two simultaneous talkers. The Journal of the Acoustical Society of America, 109(3), 1101–1109. https://doi.org/10.1121/1.1345696

17.

Carlyon

R. P.

Shackleton

T. M.

(1994). Comparing the fundamental frequencies of resolved and unresolved harmonics: Evidence for two pitch mechanisms? The Journal of the Acoustical Society of America, 95(6), 3541–3554. https://doi.org/10.1121/1.409971

18.

Censor

(2013). Generalization of perceptual and motor learning: A causal link with memory encoding and consolidation? Neuroscience, 250, 201–207. https://doi.org/10.1016/j.neuroscience.2013.06.062

19.

Censor

Sagi

(2009). Global resistance to local perceptual adaptation in texture discrimination. Vision Research, 49(21), 2550–2556. https://doi.org/10.1016/j.visres.2009.03.018

20.

Ceyhan

Dere

H. H.

Mujdeci

(2022). Evaluating the effectiveness of a new auditory training program on the speech recognition skills and auditory event-related potentials in elderly hearing aid users. Audiology & Neuro-Otology, 27(5), 368–376. https://doi.org/10.1159/000523807

21.

Darwin

C. J.

Brungart

D. S.

Simpson

B. D.

(2003). Effects of fundamental frequency and vocal-tract length changes on attention to one of two simultaneous talkers. The Journal of the Acoustical Society of America, 114(5), 2913–2922. https://doi.org/10.1121/1.1616924

22.

Davis

M. H.

Johnsrude

I. S.

Hervais-Adelman

Taylor

McGettigan

(2005). Lexical information drives perceptual learning of distorted speech: Evidence from the comprehension of noise-vocoded sentences. Journal of Experimental Psychology: General, 134(2), 222–241. https://doi.org/10.1037/0096-3445.134.2.222

23.

Delhommeau

Micheyl

Jouvent

(2005). Generalization of frequency discrimination learning across frequencies and ears: Implications for underlying neural mechanisms in humans. Journal of the Association for Research in Otolaryngology, 6(2), 171–179. https://doi.org/10.1007/s10162-005-5055-4

24.

Dimitrijevic

Pratt

Starr

(2013). Auditory cortical activity in normal hearing subjects to consonant vowels presented in quiet and in noise. Clinical Neurophysiology, 124(6), 1204–1215. https://doi.org/10.1016/j.clinph.2012.11.014

25.

Dudai

Karni

Born

(2015). The consolidation and transformation of memory. Neuron, 88(1), 20–32. https://doi.org/10.1016/j.neuron.2015.09.004

26.

Eckert

M. A.

Teubner-Rhodes

Vaden

K. I.

(2016). Is listening in noise worth it? The neurobiology of speech recognition in challenging listening conditions. Ear and Hearing, 37(Suppl 1), 101S–110S. https://doi.org/10.1097/AUD.0000000000000300

27.

El Boghdady

Başkent

Gaudrain

(2018). Effect of frequency mismatch and band partitioning on vocal tract length perception in vocoder simulations of cochlear implant processing. The Journal of the Acoustical Society of America, 143(6), 3505–3519. https://doi.org/10.1121/1.5041261

28.

El Boghdady

Gaudrain

Başkent

(2019). Does good perception of vocal characteristics relate to better speech-on-speech intelligibility for cochlear implant users? The Journal of the Acoustical Society of America, 145(1), 417–439. https://doi.org/10.1121/1.5087693

29.

Ezzatian

Pichora-Fuller

M. K.

Schneider

B. A.

(2012). The effect of energetic and informational masking on the time-course of stream segregation: Evidence that streaming depends on vocal fine structure cues. Language and Cognitive Processes, 27(7–8), 1056–1088. https://doi.org/10.1080/01690965.2011.591934

30.

Fant

(1960). Acoustic theory of speech production.

31.

Faul

Erdfelder

Lang

A. G.

Buchner

(2007). G*power 3: A flexible statistical power analysis program for the social, behavioral, and biomedical sciences. Behavior Research Methods, 39(2), 175–191. https://doi.org/10.3758/bf03193146

32.

Q.-J.

Chinchilla

Galvin

J. J.

(2004). The role of spectral and temporal cues in voice gender discrimination by normal-hearing listeners and cochlear implant users. Journal of the Association for Research in Otolaryngology, 5(3), 253–260. https://doi.org/10.1007/s10162-004-4046-1

33.

Gaudrain

Başkent

(2018). Discrimination of voice pitch and vocal-tract length in cochlear implant users. Ear and Hearing, 39(2), 226–237. https://doi.org/10.1097/AUD.0000000000000480

34.

Goldstone

R. L.

(1998). Perceptual learning. Annual Review of Psychology, 49, 585–612. https://doi.org/10.1146/annurev.psych.49.1.585

35.

Gordon-Salant

Yeni-Komshian

G. H.

Fitzgibbons

P. J.

Schurman

(2010). Short-term adaptation to accented English by younger and older adults. The Journal of the Acoustical Society of America, 128(4), EL200–4. https://doi.org/10.1121/1.3486199

36.

Gustafson

S. J.

Billings

C. J.

Hornsby

B. W. Y.

Key

A. P.

(2019). Effect of competing noise on cortical auditory evoked potentials elicited by speech sounds in 7- to 25-year-old listeners. Hearing Research, 373, 103–112. https://doi.org/10.1016/j.heares.2019.01.004

37.

Han

J.-H.

Lee

H.-J.

(2020). Noise-induced change of cortical temporal processing in cochlear implant users. Clinical and Experimental Otorhinolaryngology, 13(3), 241–248. https://doi.org/10.21053/ceo.2019.01081

38.

Hauptmann

Karni

(2002). From primed to learn: The saturation of repetition priming and the induction of long-term memory. Brain Research. Cognitive Brain Research, 13(3), 313–322. https://doi.org/10.1016/s0926-6410(01)00124-0

39.

Hauptmann

Reinhart

Brandt

S. A.

Karni

(2005). The predictive value of the leveling off of within session performance for procedural memory consolidation. Brain Research. Cognitive Brain Research, 24(2), 181–189. https://doi.org/10.1016/j.cogbrainres.2005.01.012

40.

Henshaw

Ferguson

M. A.

(2013). Efficacy of individual computer-based auditory training for people with hearing loss: A systematic review of the evidence. Plos One, 8(5), e62836. https://doi.org/10.1371/journal.pone.0062836

41.

Henshaw

Heinrich

Tittle

Ferguson

(2022). Cogmed training does not generalize to real-world benefits for adult hearing aid users: Results of a blinded, active-controlled randomized trial. Ear and Hearing, 43(3), 741–763. https://doi.org/10.1097/AUD.0000000000001096

42.

Herrmann

Johnsrude

I. S.

(2020). A model of listening engagement (MoLE). Hearing Research, 397, 108016. https://doi.org/10.1016/j.heares.2020.108016

43.

Herszage

Censor

(2018). Modulation of learning and memory: A shared framework for interference and generalization. Neuroscience, 392, 270–280. https://doi.org/10.1016/j.neuroscience.2018.08.006

44.

Hesseg

R. M.

Gal

Karni

(2016). Not quite there: Skill consolidation in training by doing or observing. Learning & Memory, 23(5), 189–194. https://doi.org/10.1101/lm.041228.115

45.

Holmes

Johnsrude

I. S.

(2021). How long does it take for a voice to become familiar? Speech intelligibility and voice recognition are differentially sensitive to voice training. Psychological Science, 32(6), 903–915. https://doi.org/10.1177/0956797621991137

46.

Irvine

D. R. F.

(2018). Auditory perceptual learning and changes in the conceptualization of auditory cortex. Hearing Research, 366, 3–16. https://doi.org/10.1016/j.heares.2018.03.011

47.

Jacoby

Ahissar

(2015). Assessing the applied benefits of perceptual training: Lessons from studies of training working-memory. Journal of Vision, 15(10), 6. https://doi.org/10.1167/15.10.6

48.

Jeter

P. E.

Dosher

B. A.

Liu

S.-H.

Z.-L.

(2010). Specificity of perceptual learning increases with increased training. Vision Research, 50(19), 1928–1940. https://doi.org/10.1016/j.visres.2010.06.016

49.

Kaplan-Neeman

Kishon-Rabin

Henkin

Muchnik

(2006). Identification of syllables in noise: Electrophysiological and behavioral correlates. The Journal of the Acoustical Society of America, 120(2), 926–933. https://doi.org/10.1121/1.2217567

50.

Karni

(1996). The acquisition of perceptual and motor skills: A memory system in the adult human cortex. Brain Research. Cognitive Brain Research, 5(1–2), 39–48. https://doi.org/10.1016/s0926-6410(96)00039-0

51.

Karni

Sagi

(1991). Where practice makes perfect in texture discrimination: Evidence for primary visual cortex plasticity. Proceedings of the National Academy of Sciences of the United States of America, 88(11), 4966–4970. https://doi.org/10.1073/pnas.88.11.4966

52.

Karni

Sagi

(1993). The time course of learning a visual skill. Nature, 365(6443), 250–252. https://doi.org/10.1038/365250a0

53.

Kishon-Rabin

Patael

Menahemi

Amir

(2004). Are the perceptual effects of spectral smearing influenced by speaker gender? Journal of Basic and Clinical Physiology and Pharmacology, 15(1–2), 41–55. https://doi.org/10.1515/jbcpp.2004.15.1-2.41

54.

Kishon-Rabin

Zaltz

(2023). The effect of noise on the utilization of fundamental frequency and formants for voice discrimination in children and adults. Applied Sciences, 13(19), 10752. https://doi.org/10.3390/app131910752

55.

Koelewijn

Gaudrain

Shehab

Treczoks

Başkent

(2023). The role of word content, sentence information, and vocoding for voice cue perception. Journal of Speech, Language, and Hearing Research, 66(9), 3665–3676. https://doi.org/10.1044/2023_JSLHR-22-00491

56.

Koelewijn

Gaudrain

Tamati

Başkent

(2021). The effects of lexical content, acoustic and linguistic variability, and vocoding on voice cue perception. The Journal of the Acoustical Society of America, 150(3), 1620–1634. https://doi.org/10.1121/10.0005938

57.

Korman

Raz

Flash

Karni

(2003). Multiple shifts in the representation of a motor sequence during the acquisition of skilled performance. Proceedings of the National Academy of Sciences of the United States of America, 100(21), 12492–12497. https://doi.org/10.1073/pnas.2035019100

58.

Kozou

Kujala

Shtyrov

Toppila

Starck

Alku

Näätänen

(2005). The effect of different noise types on the speech and non-speech elicited mismatch negativity. Hearing Research, 199(1–2), 31–39. https://doi.org/10.1016/j.heares.2004.07.010

59.

Kraus

McGee

Carrell

T. D.

King

Tremblay

Nicol

(1995). Central auditory system plasticity associated with speech discrimination training. Journal of Cognitive Neuroscience, 7(1), 25–32. https://doi.org/10.1162/jocn.1995.7.1.25

60.

Kreitewolf

Mathias

S. R.

von Kriegstein

(2017). Implicit talker training improves comprehension of auditory speech in noise. Frontiers in Psychology, 8, 1584. https://doi.org/10.3389/fpsyg.2017.01584

61.

Lau

B. K.

Ruggles

D. R.

Katyal

Engel

S. A.

Oxenham

A. J.

(2017). Sustained cortical and subcortical measures of auditory and visual plasticity following short-term perceptual learning. Plos One, 12(1), e0168858. https://doi.org/10.1371/journal.pone.0168858

62.

Lawrence

B. J.

Jayakody

D. M. P.

Henshaw

Ferguson

M. A.

Eikelboom

R. H.

Loftus

A. M.

Friedland

P. L.

(2018). Auditory and cognitive training for cognition in adults with hearing loss: A systematic review and meta-analysis. Trends in Hearing, 22, 2331216518792096. https://doi.org/10.1177/2331216518792096

63.

Levin

Balberg

Zaltz

(2022). Cortical activation in response to speech differs between prelingually deafened cochlear implant users with good or poor speech-in-noise understanding: An fNIRS study. Applied Sciences, 12(23), 12063. https://doi.org/10.3390/app122312063

64.

Levin

Zaltz

(2023). Voice discrimination in quiet and in background noise by simulated and real cochlear implant users. Journal of Speech, Language, and Hearing Research, 66, 1–18. https://doi.org/10.1044/2023_JSLHR-23-00019

65.

Levitt

(1971). Transformed up-down methods in psychoacoustics. The Journal of the Acoustical Society of America, 49(2), 467–477. https://doi.org/10.1121/1.1912375

66.

Lieberman

Blumstein

S. E.

(1988). Source-filter theory of speech production. In Speech physiology, speech perception, and acoustic phonetics (Cambridge studies in speech science and communication (pp. 34–50). Cambridge University Press.

67.

Liu

Kewley-Port

(2004). Vowel formant discrimination for high-fidelity speech. The Journal of the Acoustical Society of America, 116(2), 1224–1233. https://doi.org/10.1121/1.1768958

68.

Mackersie

C. L.

Dewey

Guthrie

L. A.

(2011). Effects of fundamental frequency and vocal-tract length cues on sentence segregation by listeners with hearing loss. The Journal of the Acoustical Society of America, 130(2), 1006–1019. https://doi.org/10.1121/1.3605548

69.

Martin

B. A.

Kurtzberg

Stapells

D. R.

(1999). The effects of decreased audibility produced by high-pass noise masking on N1 and the mismatch negativity to speech sounds /ba/and/da. Journal of Speech, Language, and Hearing Research, 42(2), 271–286. https://doi.org/10.1044/jslhr.4202.271

70.

Menning

Roberts

L. E.

Pantev

(2000). Plastic changes in the auditory cortex induced by intensive frequency discrimination training. Neuroreport, 11(4), 817–822. https://doi.org/10.1097/00001756-200003200-00032

71.

Micheyl

Delhommeau

Perrot

Oxenham

A. J.

(2006). Influence of musical and psychoacoustical training on pitch discrimination. Hearing Research, 219(1–2), 36–47. https://doi.org/10.1016/j.heares.2006.05.004

72.

Molloy

Moore

D. R.

Sohoglu

Amitay

(2012). Less is more: Latent learning is maximized by shorter training sessions in auditory perceptual learning. Plos One, 7(5), e36929. https://doi.org/10.1371/journal.pone.0036929

73.

Moulines

Charpentier

(1990). Pitch-synchronous waveform processing techniques for text-to-speech synthesis using diphones. Speech Communication, 9(5–6), 453–467. https://doi.org/10.1016/0167-6393(90)90021-Z

74.

Muller-Gass

Marcoux

Logan

Campbell

K. B.

(2001). The intensity of masking noise affects the mismatch negativity to speech sounds in human subjects. Neuroscience Letters, 299(3), 197–200. https://doi.org/10.1016/s0304-3940(01)01508-7

75.

Ning

Wright

B. A.

(2023). Evidence that anterograde learning interference depends on the stage of learning of the interferer: Blocked versus interleaved training. Learning & Memory, 30(5–6), 101–109. https://doi.org/10.1101/lm.053710.122

76.

Nygaard

L. C.

Pisoni

D. B.

(1998). Talker-specific learning in speech perception. Perception & Psychophysics, 60(3), 355–376. https://doi.org/10.3758/BF03206860

77.

Nygaard

L. C.

Sommers

M. S.

Pisoni

D. B.

(1994). Speech perception as a talker-contingent process. Psychological Science, 5(1), 42–46. https://doi.org/10.1111/j.1467-9280.1994.tb00612.x

78.

Oxenham

A. J.

(2008). Pitch perception and auditory stream segregation: Implications for hearing loss and cochlear implants. Trends in Amplification, 12(4), 316–331. https://doi.org/10.1177/1084713808325881

79.

Pinsard

Boutin

Gabitov

Lungu

Benali

Doyon

(2019). Consolidation alters motor sequence-specific distributed representations. ELife, 8, https://doi.org/10.7554/eLife.39324

80.

Roth

D. A.-E.

Appelbaum

Milo

Kishon-Rabin

(2008). Generalization to untrained conditions following training with identical stimuli. Journal of Basic and Clinical Physiology and Pharmacology, 19(3–4), 223–236. https://doi.org/10.1515/jbcpp.2008.19.3-4.223

81.

Schvartz-Leyzac

K. C.

Chatterjee

(2015). Fundamental-frequency discrimination using noise-band-vocoded harmonic complexes in older listeners with normal hearing. The Journal of the Acoustical Society of America, 138(3), 1687–1695. https://doi.org/10.1121/1.4929938

82.

Shultz

(2015). When your voice betrays you. Science, 347(6221), 494–494. https://doi.org/10.1126/science.347.6221.494

83.

Simons

D. J.

Boot

W. R.

Charness

Gathercole

S. E.

Chabris

C. F.

Hambrick

D. Z.

Stine-Morrow

E. A. L.

(2016). Do “brain-training” programs work? Psychological Science in the Public Interest, 17(3), 103–186. https://doi.org/10.1177/1529100616661983

84.

Skuk

V. G.

Schweinberger

S. R.

(2013). Gender differences in familiar voice identification. Hearing Research, 296, 131–140. https://doi.org/10.1016/j.heares.2012.11.004

85.

Smith

D. R. R.

Patterson

R. D.

(2005). The interaction of glottal-pulse rate and vocal-tract length in judgements of speaker size, sex, and age. The Journal of the Acoustical Society of America, 118(5), 3177–3186. https://doi.org/10.1121/1.2047107

86.

Smith

D. R. R.

Walters

T. C.

Patterson

R. D.

(2007). Discrimination of speaker sex and size when glottal-pulse rate and vocal-tract length are controlled. The Journal of the Acoustical Society of America, 122(6), 3628–3639. https://doi.org/10.1121/1.2799507

87.

Sobon

K. A.

Taleb

N. M.

Buss

Grose

J. H.

Calandruccio

(2019). Psychometric function slope for speech-in-noise and speech-in-speech: Effects of development and aging. The Journal of the Acoustical Society of America, 145(4), EL284–EL290. https://doi.org/10.1121/1.5097377

88.

Steinmetzger

Rosen

(2015). The role of periodicity in perceiving speech in quiet and in background noise. The Journal of the Acoustical Society of America, 138(6), 3586–3599. https://doi.org/10.1121/1.4936945

89.

Stelmachowicz

P. G.

Lewis

D. E.

Kelly

W. J.

Jesteadt

(1990). Speech perception in low-pass filtered noise for normal and hearing-impaired listeners. Journal of Speech and Hearing Research, 33(2), 290–297. https://doi.org/10.1044/jshr.3302.290

90.

Strori

Bradlow

A. R.

Souza

P. E.

(2020). Recognition of foreign-accented speech in noise: The interplay between talker intelligibility and linguistic structure. The Journal of the Acoustical Society of America, 147(6), 3765–3782. https://doi.org/10.1121/10.0001194

91.

Swanepoel

Oosthuizen

D. J. J.

Hanekom

J. J.

(2012). The relative importance of spectral cues for vowel recognition in severe noise. The Journal of the Acoustical Society of America, 132(4), 2652–2662. https://doi.org/10.1121/1.4751543

92.

Tzeng

C. Y.

Russell

M. L.

Nygaard

L. C.

(2024). Attention modulates perceptual learning of non-native-accented speech. Attention, Perception & Psychophysics, 86(1), 339–353. https://doi.org/10.3758/s13414-023-02790-6

93.

Van Wilderode

Van Humbeeck

Krampe

R. T.

van Wieringen

(2023). Toward a listening training paradigm: Evaluation in normal-hearing young and middle-aged adults. Ear and Hearing, 44(5), 1229–1239. https://doi.org/10.1097/AUD.0000000000001367

94.

Vestergaard

M. D.

Fyson

N. R. C.

Patterson

R. D.

(2009). The interaction of vocal characteristics and audibility in the recognition of concurrent syllables. The Journal of the Acoustical Society of America, 125(2), 1114–1124. https://doi.org/10.1121/1.3050321

95.

Vestergaard

M. D.

Fyson

N. R. C.

Patterson

R. D.

(2011). The mutual roles of temporal glimpsing and vocal characteristics in cocktail-party listening. The Journal of the Acoustical Society of America, 130(1), 429–439. https://doi.org/10.1121/1.3596462

96.

Wright

B. A.

Sabin

A. T.

(2007). Perceptual learning: How much daily training is enough? Experimental Brain Research, 180(4), 727–736. https://doi.org/10.1007/s00221-007-0898-z

97.

Wright

B. A.

Zhang

(2009). A review of the generalization of auditory learning. Philosophical Transactions of the Royal Society of London. Series B, Biological Sciences, 364(1515), 301–311. https://doi.org/10.1098/rstb.2008.0262

98.

Pfingst

B. E.

(2008). Spectral and temporal cues for speech recognition: Implications for auditory prostheses. Hearing Research, 242(1–2), 132–140. https://doi.org/10.1016/j.heares.2007.12.010

99.

Yonan

C. A.

Sommers

M. S.

(2000). The effects of talker familiarity on spoken word identification in younger and older listeners. Psychology and Aging, 15(1), 88–99. https://doi.org/10.1037//0882-7974.15.1.88

100.

Zaltz

(2023). The effect of stimulus type and testing method on talker discrimination of school-age children. The Journal of the Acoustical Society of America, 153(5), 2611. https://doi.org/10.1121/10.0017999

101.

Zaltz

Ari-Even Roth

Karni

Kishon-Rabin

(2018). Long-term training-induced gains of an auditory skill in school-age children as compared with adults. Trends in Hearing, 22, 2331216518790902. https://doi.org/10.1177/2331216518790902

102.

Zaltz

Goldsworthy

R. L.

Eisenberg

L. S.

Kishon-Rabin

(2020). Children with normal hearing are efficient users of fundamental frequency and vocal tract length cues for voice discrimination. Ear and Hearing, 41(1), 182–193. https://doi.org/10.1097/AUD.0000000000000743

103.

Zaltz

Kishon-Rabin

(2022). Difficulties experienced by older listeners in utilizing voice cues for speaker discrimination. Frontiers in Psychology, 13, 797422. https://doi.org/10.3389/fpsyg.2022.797422

The Impact of Trained Conditions on the Generalization of Learning Gains Following Voice Discrimination Training

Abstract

Keywords

Introduction

Materials and Methods

Participants

Speech Recognition Thresholds in Noise (SRTn)

Stimuli for the SRTn Test

SRTn Assessment

Voice Discrimination (VD) Test

Stimuli for the VD Test

VD Threshold Assessment

Study Design

Assessments

Training

Apparatus

Data Analysis

Results

Learning

Generalization

Discussion

Limitations and Suggestions for Future Studies

Conclusions

Footnotes

Acknowledgments

Data Availability

Declaration of Conflicting Interests

Funding

ORCID iD

References