Abstract
Listeners adjust their perception to match that of presented speech through shifting and relaxation of categorical boundaries. This allows for processing of speech variation, but may be detrimental to processing efficiency. Bilingual children are exposed to many types of speech in their linguistic environment, including native and non-native speech. This study examined how first language (L1) Spanish/second language (L2) English bilingual children shifted and relaxed phoneme categorization along the cue of voice onset time (VOT) during English speech processing after three types of language exposure: native English exposure, native Spanish exposure, and Spanish-accented English exposure. After exposure to Spanish-accented English speech, bilingual children shifted categorical boundaries in the direction of native English speech boundaries. After exposure to native Spanish speech, children shifted to a smaller extent in the same direction and relaxed boundaries leading to weaker differentiation between categories. These results suggest that prior exposure can affect processing of a second language in bilingual children, but different mechanisms are used when adapting to different types of speech variation.
1 Introduction
In the United States, many bilingual children acquire their first language (L1) at home and begin hearing their second language (L2) regularly upon school entry. In addition, bilingual children are often exposed to both native and non-native speakers of their languages (Place & Hoff, 2016). In the long-term, language input shapes the developing phonological systems of the bilingual child, but the exact mechanisms that children use to adapt to variability in speech are not well documented. This study aims to examine the real-time fine-grained changes that take place in a child’s phonological system as a result of various types of common language exposure in bilingual environments.
1.1 Adapting to speech variability
Listeners comprehending unfamiliar speech, such as foreign-accented speech, tend to do so more slowly and less accurately than when comprehending speech they are familiar with (Adank et al., 2009; Atagi & Bent, 2015; Bent, 2014, 2015; Creel et al., 2016; Floccia et al., 2009; Hanulíková & Weber, 2012; Holt & Bent, 2017; Munro & Derwing, 1995; Nathan et al., 1998; Newton & Ridgway, 2016). Yet, adaptation to variability in the speech input allows listeners to improve in their comprehension of unfamiliar speech even after a small amount of input (Bradlow & Bent, 2008; Clarke & Garrett, 2004; Maye et al., 2008; Schmale et al., 2012; White & Aslin, 2011; Witteman et al., 2013). This ability develops throughout childhood, with 8- to 9-year olds outperforming 5- to 6-year olds; however, even 15-year olds do not yet perform at adult-like levels (Bent, 2018; Bent & Atagi, 2015). The mechanism by which listeners respond to variability in the speech stream has many names, such as perceptual learning, adaptation, recalibration, and retuning (Dahan et al., 2008; Eisner & McQueen, 2005, 2006; Kraljic & Samuel, 2005, 2006, 2007; McQueen et al., 2006, 2012; Norris et al., 2003; Reinisch et al., 2013; Reinisch & Holt, 2014). In this work, we use the term “perceptual retuning” to broadly refer to changes in perceptual categorization including shifting and relaxing of phoneme categories.
Research using perceptual learning paradigms demonstrates that listeners can use lexical context to retune to unfamiliar speech (Norris et al., 2003). This phenomenon is known as lexically guided retuning. For example, if an atypical sound [?] is presented instead of [f] as the last sound of
A similar study with [f] and [s] perception was conducted in children, but instead of an exposure story, ambiguous sounds were presented in single word contexts (McQueen et al., 2012). The study revealed that 6-year olds, 12-year olds, and adults all adapted to perceiving the ambiguous sounds according to how the speaker had used them. Therefore, lexical context can provide children and adults with the necessary information to determine what an ambiguous fricative is most analogous to in their own systems. This retuning of phoneme categories has been proposed to be one of the mechanisms that helps listeners adapt to variability encountered in the speech stream, such as accented speech (Reinisch & Holt, 2014; Zheng & Samuel, 2020).
1.2 Generalizing retuning
While the ability to adapt to speech variability has been widely examined, the persistence of such effects, both duration of the effect and whether it generalizes to other sounds, has not been examined as extensively. In a study which examined
Examinations of speaker specificity in perceptual retuning and generalization suggest that retuning generalizes differently for different kinds of phonemes. First, it was found that retuning using stops was generalizable to a new speaker (Kraljic & Samuel, 2005), while fricatives retuned only when the exposure and test speaker were the same (Kraljic & Samuel, 2007). It was hypothesized that speaker-specific frequency cues in fricatives make it easier to identify speaker-specific productions, but Reinisch and Holt (2014) further showed that if the fricatives are produced within a similar acoustic range, they could lead to cross-speaker retuning, even from female to male speakers. Reinisch and Holt (2014) suggested that this may be the reason why exposure to multiple talkers facilitates adaptation to accented speech more so than exposure to a single talker. Multiple speakers are more likely to cover a wider acoustic range, increasing the likelihood of encountering phonemes in a similar acoustic range of a novel test speaker.
Perceptual retuning may be useful when processing an ambiguous sound with supporting lexical context, but may be more difficult to implement if the ambiguous sound is actually a valid phoneme for the listener, as could be the case with bilingual listeners. Examination of perceptual retuning in bilinguals is complex because the phonetic realization of many phonemes is not the same for bilinguals as for their monolingual counterparts in each of their languages (e.g., Flege & Eefting, 1987). Both languages of early bilinguals tend to influence each other. For example, in English and Spanish, the phoneme categories /d/ and /t/ can be differentiated along the cue of voice onset time (VOT). Flege and Eefting (1987) found that the perceptual category boundary between /d/ and /t/ for English monolingual adults was around 43 ms, for Spanish monolinguals was around 23 ms, and for early Spanish–English bilingual speakers was around 27 ms indicating that their category boundary fell in between the prototypical values of both of their languages. Similarly, in production, Spanish–English bilinguals produce stop-initial words with VOTs ranging from prototypical English to prototypical Spanish values. The VOT values of an individual are affected by linguistic variables such as the age of L2 acquisition (Thornburgh & Ryalls, 1998) and correlate with strength of their foreign accent (Major, 1987). Therefore, bilingual speech adaption is different from monolingual speech adaptation due to differences in the linguistic systems even before adaptation has taken place.
There is evidence that bilingual adults can retune phoneme categories in both their L1 and L2, and in response to exposure to both native and accented speech (Cooper & Bradlow, 2018; Cutler et al., 2018; Reinisch et al., 2013; Schertz et al., 2016; Weber et al., 2014). Furthermore, retuning across languages is also possible. In a study with Dutch native/English L2 listeners, Reinisch et al. (2013) showed that after exposure to Dutch-accented English (their L2), listeners retuned categories in Dutch (their L1). However, little is known about how flexible perceptual retuning is in bilingual
The current study examined how Spanish–English bilingual children respond to exposure to native English, native Spanish, and Spanish-accented English speech. We investigated this question within the framework provided by two hypothesized mechanisms to account for responses to speech variability.
1.3 Mechanisms to account for speech variability
A visual depiction of the two hypothesized mechanisms for responding to speech variability is found in Figure 1. The first,

Demonstration of boundary shifting (left) and criteria relaxing (right). The solid curve represents pre-retuning categorization of a stop continuum based on voice onset time (VOT). The dotted curves represent how categorization may change after retuning. The point where the curves cross the 50% vertical line represents the category boundary.
The other method for responding to speech variability, known as
Both mechanisms could be applied when using lexical context to guide retuning. A crucial element of lexically guided retuning is the presence of a lexical item to guide the top–down retuning process. Without the lexical context, the listener would have no cue as to how to shift or whether to relax their categorization. However, there are indications that the lexical context is not always necessary for perceptual changes to take place. Instead, bottom–up perceptual learning is possible through distributional or statistical learning. This has been a driving force behind explanations of infants’ acquisition of native phoneme perception before exhibiting detailed knowledge of words (Maye et al., 2002 but see Cristia, 2018) and it is an ability still present in adults (Maye & Gerken, 2000). Spanish–English bilingual children, who are the focus of the present study, are well-suited to test the role of lexically guided retuning, because they hear similar distributions of phonemes
1.4 Current study
Linking perceptual learning to accent accommodation has been attempted with adults, but it is not clear whether and how children adapt to accented speech. This study aimed to examine how phoneme categorization in school-aged children who acquired Spanish before English was affected by the common types of speech exposure they receive in their environment: native English, native Spanish, and Spanish-accented English speech. Exposure was provided through listening to short clips of stories read by a speaker of one of the three types of speech. At testing, children’s perception of phoneme categories produced by a new native English speaker was examined. We were specifically interested in how perception of English plosives along the cue of VOT would be affected by each of the three exposure types. This was tested by having children classify plosive minimal pairs (e.g., gold–cold) from a continuum of VOTs across typical English ranges (e.g., 10–60 ms) after exposure to each type of speech. The English exposure condition served as a baseline for comparison to the other two exposure conditions. Crucially, both the native Spanish and Spanish-accented English speech conditions provided exposure to phonemes with Spanish-like VOT distributions; however, only the Spanish-accented English condition provided exposure in the context of English lexical items which may allow for the use of lexically guided retuning. Without lexical context to give the listener a hint to which phoneme a sound was meant to be, changes in categorization may not take place.
Two possible mechanisms behind accent adaptation were examined: boundary shifting and category relaxation. We predicted that if lexically guided retuning was used, then exposure to Spanish-accented English would cause a shift to a lower, more Spanish-like category boundary and more boundary relaxation as compared with exposure to native English speech. However when using lexically guided retuning, native Spanish exposure should not lead to any shifting or relaxing as the lexical context at exposure is not English. However, if retuning occurs from the bottom up, then we may find shifts to more Spanish-like boundaries and boundary relaxation after exposure to both Spanish-accented and Spanish speech. Importantly, the experimental design provides a strict test of retuning such that any adaption to the exposure speaker would have to be generalized to the new test speaker. Therefore, it is also possible that cross-speaker differences would lead to no changes regardless of exposure type.
2 Method
2.1 Participants
Sixty-eight participants between ages 6 and 9 years who were exposed to Spanish from birth and English as an L2 participated in this experiment as part of a larger study. At the time of testing, children were growing up in an English-dominant, metropolitan area in the Midwest US. Exclusionary criteria included exposure to English from birth (3 participants excluded), extensive exposure to a third language (defined as more than 5% weekly exposure), parental concern or a formal diagnosis of a language disorder, failure to pass a hearing screening, standardized receptive vocabulary scores in both languages more than 1.5
Participant Characteristics.
Standard Score of Kaufman Brief Intelligence Test, second Edition (KBIT-2), Matrices subtest.
Exposure during waking hours in a typical week at time of testing.
For English, specifically exposure to a Spanish-accented English speaker. For Spanish, specifically exposure to an English-accented Spanish speaker.
For English, standard score of Peabody Picture Vocabulary Test, fourth edition (PPVT-4). For Spanish, standard score of Test de Vocabulary en Imagenes Peabody (TVIP).
For English, standard score of Woodcock Johnson III (WJ III) Picture Vocabulary subtest. For Spanish, standard score of Woodcock Muñoz III (WM III) Vocabulario sobre dibujos subtest.
For English, scaled score of nonword repetition subtest of Comprehensive Test of Phonological Processing, second edition (CTOPP-2). For Spanish, raw score of nonword repetition subtest of Test of Phonological Processing in Spanish (TOPPS).
Children were administered standardized language tests in both English and Spanish. Expressive vocabulary in English was assessed with the
Fifty-five caregivers provided answers to the background questionnaire related to schooling and language acquisition. Of these families, 62% reported their child was enrolled in Headstart, a program that provides early education services for families falling below federal poverty guidelines, when they were younger. Almost all families reported children receiving their first English exposure in an educational setting. At the time of testing, 65% of children were attending Spanish–English dual immersion schools and the majority of them received 50% English and 50% Spanish exposure in school. Overall, the children had more weekly exposure to Spanish than English. When further categorized into exposure to both native and accented speech in the two languages, Native Spanish exposure was the most common type of speech heard, followed by native English, then Spanish-accented English, and finally English-accented Spanish was the least common type of language exposure. Individual level data along with supplemental material and detailed explanations of analyses are available in the OSF repository: osf.io/bu8p4.
2.2 Stimuli
2.2.1 Exposure stories
Two child-friendly stories, each with three parts, were constructed in English and then professionally translated into Spanish. These stories were used to provide participants with exposure to the conditions of interest and to stop-initial words. Participants only heard one of the two stories. Each part of the stories was designed to include minimally a set of 18 stop-initial, early acquired nouns. These 18 words were placed in positions that would ensure that they would be produced as stop-initial words in both languages. In Spanish, it is common for stop-initial words in specific positions (e.g., after vowels or fricatives) to be spirantized such that there is no full closure and VOT cannot be measured (Fabiano-Smith et al., 2015). The careful placing of the 18 words after nasals or after a pause (such as in a list) ensured that the child would receive minimally 3 exposures to each of the targeted phonemes (i.e., /p, t, k, b, d, g/) in each condition; however, in reality, they received many more exposures to stop-initial words that naturally occurred throughout the story. The 18 words were chosen because they were stop-initial words in both English and Spanish (e.g., /d/ from dog in English but /p/ from perro in Spanish), but they were not cognates nor did they start orthographically with the same letter (e.g., gloves in English and guantes in Spanish was not part of the set). In both languages, the set of 18 words included an equal number of words from each extreme of the VOT continuum (9 voiced, 9 voiceless) and an equal number from each place of articulation (6 bilabial, 6 alveolar, 6 velar).
In addition to the 18 words in positions where they must be produced with full closures, in each English part of the story there were on average 52 other words that could potentially be produced as voiced stop-initial words and 45 potential voiceless stop-initial words. In Spanish, there were 71 potential voiced-stop initial words and 113 potential voiceless-stop initial words (and the high number of Spanish voiceless-stop initial words was in part due to 15% of the words in this set being “que”). Each part of the two stories had on average 523 words in English (range: 517–528) and 526 words in Spanish (range: 513–546). The full list of chosen 18 stop-initial words (which differed between stories) and full stories in English and Spanish are available in the OSF repository.
Stories were normed on a set of 8 native English-speaking adult listeners. They rated each part of the story for how interesting it was on a scale of 0 (
All stories were recorded in Audacity (Audacity Team, 2018) by females in their thirties in a sound attenuated booth using a Logitech USB microphone. This age was targeted to parallel the maternal input that school-aged children receive. Speakers were all residing in the Midwest US at the time of recording. From the set of story recordings gathered, which included 2 native English speakers and 3 native Spanish speakers, who also recorded Spanish-accented English stories, one speaker was chosen for each exposure category: a native English speaker, a native Spanish speaker, and a different Spanish-accented English speaker. Speaker characteristics of the 3 chosen speakers are available in Table 2. Following recording, the story parts were edited to remove any long pauses, non-speech sounds, and to normalize the intensity to the same level. Stop-initial words were further edited to ensure that for the English-speaker, the stop sounds had a VOT that fell within ranges of English (i.e., voiced: 0–30 ms; voiceless: 70–100 ms) and for the accented and Spanish speaker that they fell within the ranges of Spanish (i.e., voiced: 100–0 ms; voiceless: 0–30 ms). Although this range is on the extreme end for Spanish-accented English speaker, the values do represent natural variability, especially of a speaker with a strong accent (Thornburgh & Ryalls, 1998). Manipulation was implemented for short and long lag stops by either cutting or copying and inserting short portions of the recording between the burst release and voicing onset. Where necessary, prevoicing was added by inserting an instance of prevoicing from a different instance of a matching phoneme of the same speaker.
Speaker Characteristics.
Speaking ability was self-rated on a scale of 0 (
Foreign accent was self-rated on a scale of 0 (
Stop-initial words that were approximated naturally by the speaker were not manipulated. This included 2% of voiced-stop initial words produced by the native English speaker, 5% of voiced-stop initial words produced by the accented speaker, and 57% of voiced-stop initial words produced by the Spanish speaker. The natural VOTs for each phoneme produced by each speaker before manipulation are available in Table 3. Values were adjusted to yield an absolute average change (i.e., measuring both lengthening and shortening with positive values) of 11.6 ms for the native English speaker, 63.6 ms for the accented speaker, and 19.6 ms for the Spanish speaker. For the accented speaker, the largest absolute changes were made by the addition of prevoicing; however, both voicing categories for the accented speaker’s productions underwent significant changes. The average VOT of each phoneme by each speaker after manipulation is available in Table 4. All other aspects of production associated with native and accented speech were allowed to vary naturally based on how the speakers normally read child-directed stories.
VOT (ms) of Stops Produced by Each Speaker and Phoneme Before Manipulation of the Stories.
VOT (ms) of Stops Produced by Each Speaker and Phoneme After Manipulation of the Stories.
Children were exposed to the first part of the stories always in native English. This was done to ensure a similar starting point in exposure for all children. The language exposure prior to visiting the lab could not be controlled and starting with the same exposure condition allowed for some degree of control. The order of the second and third parts was counterbalanced between participants such that they were presented either in native Spanish second and Spanish-accented English third, or vice versa. The native English speaker read the first part of each story with an average duration of 2.6 minutes or speech rate of 4.7 syllables/second (3.3 words/second). The Spanish-native speaker had an average reading duration per story part of 3.9 minutes or a speech rate of 4.1 syllables/second (2.2 words/second). Finally, the Spanish-accented English speaker had an average reading duration of 3.7 minutes per story part or a speech rate of 3.3 syllables/second (2.34 words/second).
2.2.2 Minimal pair VOT continua
Three stop-initial minimal pair continua were created to test perceptual categorization. The targeted VOT range was 0–60 ms since this includes the range in which English category boundaries usually fall for stop sounds by Spanish-English bilingual and monolingual English children (Williams, 1979). One continuum was created for each of the three English places of stop articulation:
The minimal pairs were recorded by a native English female speaker in her twenties in a sound attenuated booth. The recordings were manipulated in Praat using a script to create a VOT continuum (Winn, 2020), which included a word at every 10-ms increment. Because of restrictions associated with children’s attention, only 6 steps of each continuum were used in testing. The category boundary of bilabial stops is generally at a shorter VOT than other stops and velar stops tend to have category boundaries at higher VOTs. In order to accurately capture the most ambiguous VOT steps for each type of stop continuum, the
2.3 Procedure
The experimental task involved a two-alternative forced-choice paradigm. Children sat in front of a Tobii Prop T60XL eye-tracker and completed 5-point calibration. They eye-tracker was used to trigger the beginning of trials to ensure children were paying attention, and not to examine eye-gaze data. In front of children was a button box used to record responses. Children were first familiarized with the six pictures used to represent the minimal pairs of each of the 3 VOT continua (
During the exposure stories and the testing blocks, the experimenter sat behind participants and interacted as little as possible with them while keeping them focused on the task. When intervention was necessary to assure that children were attentive, the experimenter provided nonverbal cues such as pointing to the screen or tapping them on the shoulder to encourage them to sit up. This was done in order to not give the children any exposure to a voice other than the exposure story speaker and test item speaker.
During testing blocks, the timing of the trial included 500 ms of the 2 pictures being presented before the fixation video appeared. The fixation video disappeared when the child made a fixation of at least 100 ms to the video or after 2,500 ms passed without a fixation to the center of the screen. The fixation video disappeared as the target word played, and the two images disappeared when the child made a button push or if 5,000 ms passed without an answer. There was a 200 ms interstimulus interval between trials. The location of the two images was counterbalanced across trials (e.g.,
2.4 Analyses
All data cleaning steps and analyses are available in the OSF repository. The dataset was first assessed for any children that did not understand the task or were answering at random. In order to exclude these children, the data were checked for those who selected the same item for more than 90% of trials (i.e., always chose the
A total of 13,484 trials were available for the final analysis. Three separate analyses were conducted to examine how exposure affected phoneme categorization. The first analysis examined boundary shifting on the VOT continuum, the second examined the change in slope of the category boundary, and the third involved categorization of the words at the extremes of the VOT continua. The second two analyses indexed criteria relaxing. Post hoc analyses to examine effects of block order were run as well. Analyses were run in R using the lme4 (v1.1.26; Bates et al., 2015), afex (v0.28.1; Singmann et al., 2017), car (v3.0.10; Fox & Weisberg, 2019), emmeans (v1.5.4; Lenth, 2019), lattice (v0.20.41; Sarkar, 2008), HLMdiag (v0.4.0; Loy & Hofmann, 2014), MuMIn (v1.43.17; Bartoń, 2020), and DHARMa (v0.4.1; Hartig, 2021) packages.
2.4.1 Category boundary shifting
The category boundary between the pair of phonemes for each continuum was calculated for each participant in each exposure condition. Each boundary was calculated using the maximum available trials from the 30 possible trials in each exposure condition (5 repetitions of 6-step continuum) based on a logistic regression model. The logistic regression model created a curve similar to that in Figure 1 and was represented by an intercept (α) and beta value (β) which can be entered into the logistic equation
A linear mixed effect model was created with category boundary as the outcome variable and fixed effects of exposure (English, accent, Spanish), continuum (/b–p/, /d–t/, /g–k/), and their interaction. Exposure was reference coded with English as the reference group. Continuum was sum coded. Random effects included a by-participant intercept, a by-participant slope for exposure, and a by-participant slope for continuum. Following model fitting, model assumptions were checked, and 6 data points were identified as regression outliers (more than 2.5
2.4.2 Slope at category boundary
Using the same 429 continua that remained after exclusion for unreasonable category boundary values, the slopes of the curve at the category boundary were calculated. This was done by taking the derivative of each logistic equation, setting
2.4.3 Categorization at extremes
Accuracy at the extremes of the VOT continua was examined by checking the choice of the dominant category at the two extremes of the continuum—voiced stops for the first two steps and voiceless stops for the last two steps. The data were not aggregated by participant for this analysis, and the outcome variable was binomial, coded 1 for dominant and 0 for non-dominant category. The full dataset with all 57 participants was used for this analysis. A logistic mixed effect model was created with the binary outcome variable, fixed effects of exposure, continuum, and voicing [voiced stop (Steps 1 and 2), voiceless stop (Steps 5 and 6)], and all higher order interactions between the three variables. Voicing was deviation coded (−0.5, 0.5) and other variables were coded the same as the previous models. The full random structure did not converge, and instead, the final model included a random by-participant intercept. Model comparisons were used to determine significance of predictors. A check of model assumptions did not reveal any problems. Means are reported as proportions, and comparisons as odds ratios.
3 Results
3.1 Category boundary shifting
See Figure 2 for a depiction of identification curves. There was a significant effect of exposure,

Raw performance for categorization of stops on VOT continua by exposure type.
Category Boundaries (VOT in ms) by Exposure Condition and Place of Articulation. Mean (
3.2 Slope at category boundary
There was a significant main effect of exposure,
Slope at Category Boundary by Exposure Condition and Place of Articulation. Mean (
3.3 Categorization at extremes
There was a significant effect of exposure, χ2 (2) = 13.9,
Proportion Choice of Dominant Categories at Extremes of VOT Continua by Exposure, and Place of Articulation. Means (
3.4 Block order effects
Since all participants received exposure to the native English speaker first, post hoc analyses examined how order of presentation may have driven results due to fatigue, acclimation to the task, or the test speaker’s voice. Full analyses and visualizations are available in supplementary analyses on OSF. First, the English exposure block was examined on its own. To examine if children adapted to the test speaker’s voice, performance during the first 45 versus the last 45 trials of testing following English exposure was examined. Category boundary and slope did not significantly differ between the first versus the last trials. This effect also did not significantly interact with place of articulation for category boundary and slope. Therefore, the lower category boundary found after native English exposure is likely
Another post hoc analysis examined block effects in the counterbalanced second and third blocks. The variables order (2nd or 3rd) and non-English exposure conditions (Spanish or Accent) were used. One set of models included the three-way interaction among order, exposure, and continuum with the same random effects as in the analyses for boundary and slope above. Another set of models did not include the effect of continuum to simplify the model. In all models run, the effect of order and its interaction with other effects were not significant.
4 Discussion
The effects of exposure to native Spanish, native English, and Spanish-accented English speech on school-aged Spanish-English bilingual children’s categorization of English voiced and voiceless stops was examined in this study. Bilingual children presented with different patterns of adaption to the different types of speech. Contrary to our hypothesis, following exposure to Spanish-accented English speech, the category boundary between voiced and voiceless stops when processing English words was shifted to a higher, more English-like category boundary as compared with the boundary following native English exposure. Categorization at the extremes indicated that some criteria relaxation may have been taking place following accented exposure, but only for velar stops. Following native Spanish exposure, children shifted to a more English-like category boundary and also showed criteria relaxation.
The category boundary shift after exposure to the accented speaker was in a more English-like direction, contrary to our hypothesis. Rather than capturing accommodation to accented and Spanish speech, it appears that we captured the effects of returning to processing native English after exposure to speech variability. Our test speaker was a fourth, native-English speaker who differed from all of the exposure story speakers. This design allowed a strong direct comparison between exposure conditions, but hindered us from knowing how children categorized sounds
Criteria relaxation, which was found following exposure to Spanish speech, aligns with our predictions that listeners would be less stringent in their acceptance of stops to each category after exposure to Spanish phonology if perceptual tuning is bottom up. However, if this was the case, we would expect exposure to
Our experiment provides some hints as to why accented speech comprehension does not reach adult-like levels until well into adolescence (Bent, 2018). Children may not yet be applying the mechanisms necessary to adapt to accented speech, despite having the ability to do so. Zheng and Samuel (2020) found more evidence for criteria relaxation than boundary shifting as a mechanism for natural accent adaptation in adults. Examination of performance in the accented condition suggests that criteria relaxation may not yet be strongly used by children. However, we have evidence that children were
Our results show that children are capable of using the perceptual retuning mechanisms in their L2 when encountering speech variability. Past research has found that children are capable of retuning categories to accommodate artificially induced phonological deviations in their native language (e.g., [?] instead of [s] as in McQueen et al., 2012) and a separate line of research has shown that children in the school age range are capable of accommodating accented speech broadly (e.g., Bent & Atagi, 2017). Our findings indicate that similar mechanisms may be used by children, though not fully proficiently, when processing speech variability in a second language. Even in adults, perceptual retuning in an L2 has been shown to be possible (Cooper & Bradlow, 2018; Reinisch et al., 2013), but not consistent (Cutler et al., 2018; Llanos & Francis, 2017). The structure of the phonologies of the two languages being examined affects performance in adults and this point is relevant to children as well. In a study with English and Mandarin listeners, Cutler et al. (2018) found perceptual retuning of English following English exposure in native English listeners, but not in L2 English listeners. Cutler and colleagues (2018) posit that the skill of perceptual retuning may not occur in an L2 if the L2 (English) has a larger phonemic inventory, or lacks critical cues to a contrast as compared with the L1 (Mandarin). In our study, L2 English had both a larger phonemic inventory than Spanish and lacked the critical prevoicing cue specific to the contrast we targeted. This lack of symmetry may make it more difficult to apply perceptual learning in L2 English, but would not preclude it from being captured in the other direction—if we had tested adaptation in Spanish with L1 English bilinguals.
Our findings indicate that while effects of exposure to native English differed from the other exposure types, effects of exposure to accented and Spanish speech never significantly differed from each other. Therefore, lexically guided retuning did not play a strong role in driving adaptation. When exposed to Spanish-accented speech, the Spanish phonemes were embedded in English words and this was expected to aid in children’s shifting of perceptual boundaries. When exposed to Spanish speech, Spanish phonemes were embedded in words, but as the words were Spanish, it was not expected that the lexical context would drive adaptation in English. Yet, as performance in the accent and native Spanish conditions did not differ, it is likely that lexically guided retuning was not at play. Rather, differences between the English condition and the other two conditions may have been due to exposure to different distributions of phonemes across the VOT cue. While lexically guided retuning is often construed as a top–down process, it has also been suggested that retuning can happen as a bottom-up process simply due to statistical learning of the phonetic distribution (Idemaru & Holt, 2014; Schertz et al., 2016). If children in this experiment adjusted their perception based simply on phoneme distributions, then similar performance in the Spanish and Spanish-accented English condition would be expected as the distribution of voiced and voiceless stops along the cue of VOT were almost the same in these two conditions. Therefore, bottom–up learning could be the mechanism driving differences between the English condition and the other two conditions in our experiment. This explanation does require the presupposition that distributional learning is language-independent, that is, that distributional learning of Spanish can affect perception in English. This idea has received some support in the bilingualism literature where the perceptual space is increasingly considered to be nonindependent in children (Persici et al., 2019) and adults (e.g., Marian & Spivey, 2003).
We must take care when interpreting the findings of this study because it is possible that the first exposure condition being always fixed to native English affected the results. Despite a lack of evidence of this from post hoc analyses, future studies would need to confirm that these results hold with a different order of testing. A further point to consider is that we manipulated the VOT for each condition to be within an absolute range, which may have led to unnatural productions. The range of VOT for the accented speaker specifically, incurred the most manipulations to add prevoicing to some items that were not prevoiced as well as to shorten the VOT of some voiceless stops. The speaker naturally produced VOTs that were within the range of typical Spanish–English bilinguals, but they were not as extreme as the ones associated with strongly accented speech. This speaker was chosen because they had a perceptible accent and yet were highly intelligible. We aimed to only manipulate the phonetic and phonological elements of accented speech and not additionally have differences in how reliably children might perceive the speaker’s speech. The outcome of this choices is that the manipulated VOTs may have seemed less natural. This may have contributed to the unexpected direction of the shift in category boundary after accented speech exposure. Examining performance after exposure to different strengths of accents, including those with VOT values naturally more similar to Spanish and those naturally more similar to English would also be an informative follow-up experiment. The manipulation to absolute ranges also did not take into account how speech rate may affect VOT (e.g., Magloire & Green, 1999). The English speaker naturally read stories faster than the other two speakers, which could lead to her VOTs being perceived to be exaggeratedly long. However, this is still in line with the manipulation to have the English speaker have longer VOTs than the other two speakers. Altogether, although we maintained control of VOT values themselves, some of the natural variability present in the different speakers was not fully accounted for. Follow-up studies with other speakers would help confirm whether effects are voice-specific or generalize to other speakers. In addition, as we only focused on a single cue, we did not capture adaptation to other cues or reweighting of cue reliance. Future studies should examine more than a single cue to form the full picture of how children adapt to accented speech.
5 Conclusion
We have demonstrated that comprehension of an L2 in bilingual children is affected by the natural language environments to which bilinguals are exposed in their day-to-day life. School-aged bilingual children made subtle changes to phoneme classification in English in response to the 3-minute language exposure they had encountered. When exposed to Spanish-accented English speech, they were more likely to shift their category boundary to be more English-like than after exposure to native English speech. When exposed to native Spanish speech, they shifted to a smaller extent, and relaxed category boundaries as compared with exposure to native English speech. We have demonstrated that phonological shifting and relaxing are both used by bilingual children to accommodate to different types of language exposure, and that bottom up retuning broadly explains our results better than lexically guided retuning. Future work would need to consider how these mechanisms are applied to a variety of phonemic contrasts during accent processing to determine if relaxation and boundary shifting are generalized mechanisms or if they are contrast-specific.
Footnotes
Acknowledgements
The authors thank all the children and caregivers who participated in the study, and the members of the Language Acquisition and Bilingualism Lab and Centre for Child Language Research for assistance in data collection, coding, and manuscript review. They especially thank Tania Zamuner for helpful comments on the manuscript.
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by the National Institutes of Health National Institute on Deafness and Other Communication Disorders (grant nos. R01 DC016015 and P30 HD003352) and the National Science Foundation (grant no. BCS-1749378).
Supplemental material
Supplemental material for this article is available online.
