Abstract
This longitudinal study investigates how auditory acuity and the amount of second language (L2) input lead learners to shift L2 cue-weighting strategies towards native norms in an immersion context. Thirty Mandarin-speaking students were tested on Spanish lexical stress perception upon arrival (T1) and before leaving Spain after one academic year (T2). Due to cross-linguistic influence, the learners placed more weight on pitch for stress perception, unlike Spanish natives, who relied more on duration. However, at T2, the learners upweighted the duration cue, suggesting a potential L2 cue-weighting shift. At the individual level, more L2 input helped learners with lower pitch acuity shift towards a native-like cue-weighting strategy, whereas those with higher pitch acuity showed the opposite pattern. The results highlight the persistent influence of L1 prosodic transfer and the interaction of cognitive and experiential factors in reshaping L2 perceptual categorization, which underscores the importance of individual differences in L2 speech acquisition in naturalistic settings.
1. Introduction
People use multiple acoustic cues to categorize linguistic units (Hillenbrand et al., 1995; Holt and Lotto, 2006), but listeners tend to weigh one cue over the rest in speech perception. For instance, the American English tense–lax vowel pair /i:–ɪ/ contrasts in both duration and formant, but native listeners predominantly use formant cues to distinguish the two vowels (Hillenbrand et al., 1995; Kondaurova and Francis, 2008). Similarly, English /r/ and /l/ acoustically differ in the onset of the second and third formants (F2 and F3), but F3 outweighs F2 for L1 English listeners’ identification of /r–l/ contrast (Iverson and Kuhl, 1994; Iverson et al., 2003). More importantly, the ‘cue-weighting’, which refers to the relative use of various acoustic cues in language perception, varies across languages. For instance, lexical stress can be defined by pitch, duration, intensity, and formants in various languages (Fear et al., 1995; Jasmin et al., 2023). However, Spanish listeners place more weight on duration to perceive Spanish lexical stress (Li and Xi, 2024b), whereas English listeners rely on formant more than pitch to distinguish English lexical stress (Wang et al., 2024).
Crosslinguistic differences in cue-weighting can pose challenges for second language (L2) learners, as they may transfer their first language (L1) cue-weighting patterns to the L2 (Francis and Nusbaum, 2002; Francis et al., 2000; Kondaurova and Francis, 2008; Zhang and Francis, 2010). From the longitudinal perspective, learners gradually shift their cue-weighting strategies towards the native norms of the target L2 (Yazawa et al., 2020). Therefore, understanding how learners make the shift is important for L2 perceptual learning. Moreover, since cue-weighting strategies can vary among individuals in both L1 and L2 (Chandrasekaran et al., 2010; Kong, 2019; Kong and Kang, 2023), it is important to investigate how individual differences account for the L2 cue-weighting change in the long run. This study focuses on L1 Mandarin listeners to assess how they shift their L2 cue-weighting for Spanish lexical stress perception in a study abroad (SA) context over one academic year.
1.1. Cue-weighting in L2 perceptual learning and the role of amount of L2 input
The cue-weighting theory predicts that when learning an L2, L2 learners may differ from L1 listeners in their cue-weighting strategies for speech perception (Francis and Nusbaum, 2002; Holt and Lotto, 2006; Kondaurova and Francis, 2008; Zhang and Francis, 2010). For instance, some L2 learners of English (e.g. Japanese) do not rely on formant but rather duration to perceive the tense–lax vowel contrasts due to L1 transfer (Yazawa et al., 2020). Even though learners’ L1 does not show a vowel length contrast (e.g. Spanish, Russian, etc.), they may still place more weight on duration (Cebrian, 2006; Kim et al., 2018; Kondaurova and Francis, 2008) because processing durational cues may be cognitively less demanding than forming new categories relying on formant (Bohn, 1995; Escudero and Boersma, 2004). To perceive English lexical stress, both L1 Mandarin and L1 English listeners use formant and pitch information, but Mandarin listeners rely more on pitch (Zhang and Francis, 2010), while English listeners rely more on formant (Braun et al., 2011; Cutler, 1986; Cutler et al., 2007; Fear et al., 1995; Small et al., 1988; Wang et al., 2024). Studies on other language pairs also revealed similar conclusions, including L1-English–L2-Spanish, L1-Dutch–L2-English, among others (Ortega-Llebaria et al., 2013; Ortín and Simonet, 2022, 2023; Romanelli and Menegotto, 2015; Romanelli et al., 2015; Sagarra et al., 2024; Tremblay et al., 2021). These studies suggest the existence of a potential hierarchy of acoustic cues in speech perception: acoustic cues that are more salient in L1 or cognitively less demanding may weigh more in L2 perception than in L1 listeners of this same language.
L2 perceptual learning involves gradually shifting the cue-weighting strategies from the learners’ L1 patterns towards the native norms of the target L2, which we will term as ‘cue-weighting shift’. As perceptual cue-weighting seems to be stable across listeners and contexts, adult L2 learners often face challenges in cue-weighting shift. For instance, Korean lenis–fortis stops are primarily distinguished by pitch with little difference in voice onset time (VOT) (Holliday, 2014; Martínez García and Holliday, 2019). Lee et al. (2022) tested the development of French-speaking beginners’ perception of L2 Korean stops in a classroom learning setting. After one academic year, the learners showed limited cue-weighting shift from VOT to pitch to distinguish Korean lenis–fortis stops. The results suggest that, with instruction in a classroom context, it is not easy for cue-weighting to shift; more immersive exposure or explicit training is needed.
Nevertheless, cue-weighting shift can occur as learners’ L2 proficiency and L2 experience increase (Gilbert et al., 2026; Kong and Kang, 2023; Yu, 2023). For instance, L1 Dutch listeners with more proficient English abilities rely more on formants to perceive L2 English lexical stress than less proficient ones, which is considered more native-like (Tremblay et al., 2021). Similarly, L1 Mandarin listeners with long-term residence (3+ years) in English-speaking countries rely more on duration to perceive English phrase boundary contrasts than short-term residence (less than 1 year) learners, suggesting a potential trend for cue-weighting shift resulting from immersion experience (Petrova et al., 2023). In these cross-sectional studies, although the learners still overweigh the acoustic cues transferred from their L1, sufficient L2 experience can modify their cue-weighting strategy, especially in the naturalistic learning context of residence abroad.
However, what matters for cue-weighting shift seems to be the amount of L2 input rather than the length of residence as the two concepts do not necessarily entail each other (Flege and Bohn, 2021). Yazawa et al. (2017) used simulated data to model the relationship between L2 input and cue-weighting shift with a virtual learner. Their computational model predicts that although in 10 months, the virtual learner has started a cue-weighting shift, it requires a large amount of L2 input, which assumes 1,000 times of exposure to the target sounds per month. However, evidence from real learners shows that long-term residence does not entirely shift the learners’ L2 perceptual cue-weighting (Cebrian, 2006; Ingvalson et al., 2011). Recent studies demonstrate that the amount of L2 input positively affects L2 speech learning in an immersion context (Sun et al., 2024; Turner, 2024). Therefore, it is plausible that for L2 cue-weighting shift to occur, a sufficient amount of L2 input is necessary. In other words, even with the same length of residence, learners with more L2 input are more likely to shift their cue-weighting toward native-like norms than those with less input.
Notably, re-weighting the acoustic cues does not imply an absolute trade-off where upweighting one cue automatically downweights the other. Instead, learners can enhance sensitivity to multiple cues and establish a cue-weighting strategy that falls between their L1 pattern and the native norms of the L2. In Petrova et al.’s (2023) study, for instance, although the long-term residence group placed more weight on duration for phrase boundary contrast than the short-term group, they also did so with pitch. This suggests that even though a potential cue-weighting shift can be expected after long-term immersion, extensive L2 exposure may also upweight the acoustic cues that learners transfer from their L1.
Finally, although self-reports cannot avoid subjectivity, language questionnaires remain the most practical and widely used tool for indexing L2 input in longitudinal studies (Turner, 2024). Large-scale recording or ecological momentary assessment methods may pose challenges from logistical and analytical perspectives, especially when participants are dispersed across various institutions during SA. End-of-SA language history questionnaires can be uniformly administered and reliably coded, and these have yielded consistent results across studies. For instance, with questionnaires, previous studies have discovered the significant predictive role of self-estimated L2 use in the improvement of L2 pronunciation after study abroad, including media exposure, hours per day speaking with native speakers, and so on (Díaz-Campos, 2004; Muñoz and Llanes, 2014; Stevens, 2011; Sun et al., 2024; Turner, 2024). It seems that questionnaires can balance feasibility and validity for examining naturalistic L2 input in an SA context. Therefore, in the current study, we used a questionnaire to survey the participants’ amount of L2 input during SA.
To summarize, learners’ L1 cue-weighting strongly influences L2 perceptual categorization. Therefore, the cue-weighting shift in the L2 may need a long time and, crucially, a substantial amount of L2 input. This calls for longitudinal studies in an immersion context, which is the first motivation of the current study.
1.2. The role of auditory acuity in L2 speech perception
Beyond the influence of learners’ L1 background, individual differences in auditory acuity also play an important role in L2 speech learning. Auditory acuity can be assessed using either linguistic or non-linguistic stimuli, and these two types of abilities affect L2 learning differently. A well-studied example concerns pitch processing and the learning of L2 lexical tones. Linguistic pitch processing abilities, such as the identification or discrimination of Mandarin lexical tones, can predict success in L2 tonal word learning (Bowles et al., 2016; Chui and Qin, 2024; Cooper and Wang, 2012; Qin et al., 2021; Wong and Perrachione, 2007). By contrast, non-linguistic pitch processing abilities, measured with sine-wave or pure-tone stimuli, can better predict learners’ ability to generalize tonal learning to new contexts (Bowles et al., 2016).
Building on this distinction, recent research has increasingly focused on domain-general auditory acuity assessed with non-linguistic stimuli. Domain-general auditory acuity refers to the ability to precisely perceive basic acoustic dimensions such as pitch, formant, and duration, which is crucial for making effective use of L2 input during learning (Saito, 2023). Higher domain-general auditory acuity is associated with greater success in various aspects of L2 learning, including lexical proficiency (Saito et al., 2022c), speech perception (Kachlicka et al., 2019; Saito et al., 2022a), and production (Li et al., 2026; Saito et al., 2020a). Importantly, these effects are more consistently observed in immersion contexts than in classroom settings, possibly because classroom environments provide insufficient auditory input (Saito et al., 2021).
At the same time, domain-specific auditory acuity, particularly sensitivity to acoustic cues that are critical in the target L2 features, plays a central role in speech perception and learning. Higher pitch acuity is associated with better learning outcomes in L2 lexical tones (Qin et al., 2021, 2022; Zhou and Veríssimo, 2026) and intonation (Zheng et al., 2022). Similarly, duration acuity predicts the perception of L2 voicing contrasts based on VOT (Liu, 2022), the learning of L2 vowel length contrasts (Kempe et al., 2015), and lexical stress (Zheng et al., 2022). Finally, formant acuity is particularly relevant for acquiring difficult L2 vowel contrasts or consonant contrasts that require fine-grained spectral discrimination abilities (Lengeris and Hazan, 2010; Saito et al., 2022b).
However, few studies have tested the role of domain-specific auditory acuity in L2 cue-weighting shift, which requires a longitudinal perspective. One can formulate different hypotheses regarding the role of auditory acuity in the cue-weighting shift. First, if a learner has accurate domain-specific auditory acuity, in pitch for instance, more L2 input will strengthen their pitch-cue reliance for L2 perceptual categorization, especially when pitch is relevant to perceiving the L2 category (e.g. lexical stress). In line with this hypothesis, a recent study found that L1 Mandarin students with more precise formant discrimination improved L2 English stress production after eight-month study abroad (Saito et al., 2020b). It might be that precise formant discrimination abilities directed the learners’ attention to formant cues in speech perception, which was later applied to speech production. However, this speculation cannot be validated without direct perceptual evidence. A second hypothesis could be that, if a learner does not have a precise discrimination ability in the critical acoustic cue, more L2 input may draw their attention to other acoustic cues. For instance, an L1 Mandarin listener with weak pitch discrimination abilities would pay less attention to pitch in L2 stress perception, which would shift their cue-weighting towards other critical cues like duration. Again, evidence is needed to validate this assumption.
Overall, to investigate L2 cue-weighting shift, we need longitudinal evidence for the role of domain-specific auditory acuity and the amount of L2 input. This research gap motivates our decision to include the measures of the two individual difference factors.
1.3. Word-level prosody and acoustic cues of Spanish and Mandarin
The present study explores the lexical stress categorization of L2 Spanish by L1 Mandarin listeners. Therefore, we briefly present here the word-level prosody and the acoustic cues of the two languages.
Spanish is a stress language with lexical stress distinguishing lexical meanings, which is mainly cued by suprasegmental correlates (Ortega-Llebaria et al., 2013). In isolated words, stressed syllables are marked by higher pitch, longer duration and greater intensity than unstressed syllables (Hualde, 2012). When a word is embedded in a phrase, duration becomes a more reliable acoustic cue than pitch for perceiving lexical stress (Torreira et al., 2014), because pitch variations primarily mark phrase-level intonation patterns rather than word-level stress. Specifically, in the sentence-final (nuclear) position, the stressed syllable is often lengthened but shows various pitch patterns depending on the pragmatic function. In the non-final (prenuclear) position, a non-accented or low-accented word may assign a low pitch to the stressed syllable, whereas the pitch peak is sometimes delayed to the post-stressed syllable (Hualde and Prieto, 2015). In this case, duration becomes a more reliable cue for lexical stress contrasts (Ortega-Llebaria and Prieto, 2011). This interplay highlights how duration anchors stress identification at the sentence level, which challenges L2 learners’ perception of Spanish lexical stress.
Mandarin is a tone language, where the lexical tone is primarily organized around pitch specifications, while duration plays a secondary but functionally relevant role in lexical stress (Duanmu, 2007). In Standard Mandarin, each syllable is assigned one of the four citation tones, which differ in pitch height and contours: a flat tone mā ‘mother’, a rising tone má ‘hemp’, a dipping tone mǎ ‘horse’, or a falling tone mà ‘to scold’. Mandarin also shows evidence of lexical stress for which duration is the main phonetic correlate (Chen and Xu, 2006; Qu, 2013; Xu, 1997). A weak syllable with a neutral tone (e.g. ma, a question particle) can only be attached to a strong syllable with one of the four citation tones. Weak syllables are shorter than strong syllables but also vary in pitch contours (Lee and Zee, 2008). Because of the high functional load of lexical tones, tone language speakers are sensitive to pitch specifications (Bidelman et al., 2013; Chandrasekaran et al., 2009; Krishnan et al., 2009; Petrova et al., 2023). At the same time, as duration plays an important role in lexical stress, Mandarin listeners are still capable of acquiring duration as an acoustic cue in the perception of L2 suprasegmental features (Petrova et al., 2023; Qin et al., 2017).
Previous research has shown that Mandarin listeners largely rely on pitch for L2 Spanish lexical stress even after years of residence in Spanish-speaking countries. In perception, L1 Mandarin listeners relied on pitch more than duration to identify Spanish lexical stress at the sentence level, in contrast to L1 Spanish listeners who rely more on duration (Li and Xi, 2024b). Similarly, in running speech, Mandarin speakers produce Spanish stressed vowels with significantly higher pitch and longer duration than unstressed vowels or syllables without pitch accent, but Spanish speakers primarily manipulate duration (Li and Xi, 2022, 2023, 2024a). Therefore, we can predict that L1 Mandarin listeners will initially place more weight on pitch for the perception of Spanish lexical stress at the beginning of their stay in Spanish-speaking environments, which is the case with our participants.
1.4. The present study
The present study is motivated by two research gaps. First, there is not much longitudinal evidence on how the amount of L2 input facilitates cue-weighting shift in L2 perceptual categorization. Second, it is not clear whether and how auditory acuity and its interaction with the amount of L2 input affect the cue-weighting shift in L2 perception. We therefore formulate the following two hypotheses.
Hypothesis 1. At the group level, Mandarin listeners will initially rely more on pitch than duration to perceive L2 Spanish lexical stress due to L1 cue-weighting transfer. However, as duration plays a significant role in Spanish lexical stress perception, learners may upweight duration after SA. At the same time, since pitch is a relevant cue as well, learners can also reinforce their pitch-cue reliance after SA. These changes will result in a more categorical perceptual pattern of lexical stress.
Hypothesis 2. If SA leads to learners’ cue-weighting shift, they will show reduced reliance on pitch relative to duration. At the individual level, we expect that auditory acuity affects L2 cue-weighting, and that effect will emerge when participants receive a sufficient amount of L2 input. Specifically, participants with higher pitch acuity will place more weight on pitch to identify L2 lexical stress with an increased amount of L2 input. By contrast, those with lower pitch acuity would downweight the pitch cue with an increased amount of L2 input, which would facilitate a cue-weighting shift towards duration. The same hypothesis is formulated for duration acuity and duration cue-weighting.
2. Methods
2.1. Participants
This study was approved by the Norwegian Agency for Shared Services in Education and Research (SIKT). We recruited two groups of participants: a Chinese student group and a Spanish native group. All participants signed a consent form which allowed the researchers to collect and analyze their personal data. No one reported any history of speech or hearing impairments.
The Chinese students were 30 L1 Mandarin female learners of Spanish (Mage = 22.23 years, SD = 0.68). All were graduates in Spanish language and had received formal instruction in Spanish language for an average of 4.17 years (SD = 0.46) in China. None of the participants had prior SA experience. They moved to Spain to pursue a master’s degree, which was taught in Spanish. All the participants had taken at least one standardized Spanish proficiency test, Diploma de Español como Lengua Extranjera (DELE) and/or Servicio Internacional de Evaluación de la Lengua Española (SIELE), according to which, their Spanish proficiency ranged from B1 (intermediate) to C1 (advanced).
The Spanish natives were 19 female speakers of Peninsular Spanish (Mage = 21.11, SD = 3.05). They all grew up in Spain and reported using Spanish for their daily communication needs. None reported any study or living abroad experience except for short trips to foreign countries.
2.2. Materials and procedure
This study was part of a large SA project that consisted of a variety of tasks over one academic year. Considering the research questions at hand, we will focus on reporting the Spanish lexical stress perception task, the auditory acuity test, and the L2 use questionnaire.
2.2.1. Spanish lexical stress perception task
The auditory stimuli of the Spanish lexical stress perception task were recorded by a female L1 speaker of Castilian Spanish (age 26 years) in a soundproof room using a Shure SM35 headset microphone and a Zoom H4n Pro portable recorder. The speaker produced the words paso and pasó within the carrier sentence Por la plaza at a normal speech rate. The sentences differed in meaning due to the stress patterns of the words: the strong–weak (SW) paso pattern ‘I pass through the square’ and the weak–strong (WS) pattern pasó, ‘He passed through the square.’
We used Praat (Boersma and Weenink, 2017) to manipulate the auditory stimuli as follows. First, the intensity of the two syllables pa and so was adjusted to match the carrier sentence’s mean intensity of 64 dB, which remained unchanged in subsequent steps. Next, we created 49 auditory stimuli that varied in seven steps from word-initial stress paso to word-final stress pasó along two dimensions: vowel pitch and duration. For pitch cues, based on previous research indicating a mean fundamental frequency (F0) difference of 12 Hz between stressed and unstressed vowels in female Spanish native speakers (Li and Xi, 2022), we manipulated the mean F0 difference between the vowels in pa and so in seven steps: −18, −12, −6, 0, 6, 12, and 18 Hz, with a 12 Hz difference representing the medium-level contrast. From each of the seven steps along the F0, a seven-step continuum of duration was created. For duration cues, female Spanish speakers produced stressed vowels 1.5 times longer than unstressed vowels (Li and Xi, 2022). The mean vowel duration of the original paso and pasó produced by our model speaker was approximately 100 ms. We thus manipulated the vowel duration difference between pa and so in seven steps: −75, −50, −25, 0, 25, 50, and 75 ms, with a duration ratio of 1.5 (150 ms : 100 ms) being the medium level. The manipulation gave a 7 × 7 matrix of pitch and duration cues for the auditory stimuli. Finally, the 49 stimuli were appended after the carrier sentence to create the experimental stimuli.
The stress perception task was conducted using Praat. Participants first received instructions in their L1 to ensure they understood how to complete the task. They then completed five practice trials and had the opportunity to raise any questions with the experimenter. In the main task, participants listened to the 49 stimuli presented in a randomized order within a single testing block. Each stimulus was repeated 4 times, resulting in a total of 196 trials per participant. The entire task took approximately 10 minutes.
Chinese students completed the lexical stress perception task twice: once before SA (at T1) and once after (at T2), whereas Spanish natives completed the task only once. During each trial, participants listened to a sentence stimulus while seeing at the top of the screen an incomplete sentence without the subject noun ‘Por la plaza, paso/pasó ______’. They had to select the subject noun to complete this sentence based on their perception of lexical stress patterns of the verb. A yo response would indicate that the participants perceived a SW stress pattern of the verb, whereas an él response would indicate a WS pattern was perceived. Responses were recorded via keyboard key press, with participants pressing either the ‘Z’ or ‘M’ key. To minimize potential hand preference bias, the key-to-response mapping was counterbalanced across participants. Upon pressing the key for response, the subsequent trial began automatically.
2.2.2. Auditory acuity test batteries on pitch and duration at T1
We selected two AXB discrimination tasks from the auditory acuity test batteries (Mora-Plaza et al., 2022) to measure individuals’ perceptual sensitivity to pitch and duration. Only the Chinese students participated (at T1) in the two tasks which followed the same structure. At each trial, participants were presented with three auditory stimuli and were required to indicate whether the second stimulus differed from the first or the third by selecting numbers ‘1’ or ‘3’ displayed on the screen. Both the pitch and duration discrimination tasks comprised a continuum of 100 synthesized stimuli. In the pitch discrimination task, the F0 of the 100 auditory stimuli ranged from 330.3 Hz to 360 Hz. In the duration discrimination test, the duration continuum ranged from 0.25 ms to 0.5 ms. The 100 stimuli were organized by difficulty, with the task initiating at stimulus level 50. Task difficulty was dynamically adjusted based on participants’ responses: three consecutive correct responses would increase the difficulty by 10 levels, while a single incorrect response would result in a 10-level decrease.
The two tasks had a combined duration of 10 minutes. The system generated the results for each participant, which were later converted to Pitch Discrimination Score (in Hertz) and Duration Discrimination Score (in milliseconds) following Zheng et al. (2022). The two discrimination scores indicate an individual’s pitch and duration perceptual threshold, with a low score meaning good auditory acuity. For more details of the stimuli creation, refer to Kachlicka et al. (2019).
2.2.3. Questionnaires
We designed two questionnaires adapted from the language history questionnaire (Li et al., 2020) and the L2 English experience questionnaire (Sun et al., 2024). Only Chinese students were required to complete the questionnaires, which were therefore written in Chinese to ensure understanding. First, the linguistic background questionnaire used at T1 (see Appendix A) collected participants’ demographic information and linguistic experience, including current age, age of acquisition of Spanish, years of learning Spanish, and proficiency of all learned languages. Then, the language use questionnaire used at T2 (see Appendix B) asked participants to estimate the number of hours per day that they spent on various listening and speaking activities in Mandarin, English, Spanish, and Catalan during SA. Catalan was included because the participants resided in a Spanish-Catalan bilingual city. For this study, we only analyze the data of listening activities, which covered language input through music (e.g. songs), auditory materials (e.g. radio), audiovisual materials (e.g. TV), academic content (e.g. lectures), and non-academic content (e.g. casual conversations).
2.3. Data coding and statistical analyses
The final dataset comprised the following, with three Chinese students dropping out at T2. First, the lexical stress perception task yielded a total of 14,896 responses [(30 Learners at T1 + 27 Learners at T2 + 19 Spanish Natives) × 196 trials]. Second, the auditory acuity test at T1 yielded 30 Pitch Discrimination scores and 30 Duration Discrimination scores. Third, 27 Chinese students filled out the L2 use questionnaire at T2.
2.3.1. Coding the Spanish lexical stress perception task
We binary-coded the participants’ responses in the Spanish lexical stress perception task. Accordingly, 1 represented a yo response, which indicated a SW stress pattern, while 0 corresponded to an él response, a WS stress pattern.
2.3.2. Measure for the cue-weighting shift
We calculated the cue-weighting score for Spanish lexical stress perception of each participant, following Yazawa et al. (2020). We applied a logistic regression for every participant, with the binary responses as the dependent variable, and the z-scored pitch and duration steps as predictors. The resulting regression coefficients (β) for pitch and duration reflected the relative contribution of each cue to stress perception. The pitch cue-weighting (PQ) and duration cue-weighting (DQ) were calculated using Formulas (1) and (2), respectively. The sum of PQ and DQ for each participant equals 1. For instance, if one’s PQ is 0.7, their DQ will be 0.3. In other words, analyzing either measure will yield the same results. Therefore, we opted for PQ as the cue-weighting measure for the subsequent calculation.
To assess the extent to which the Chinese students shifted their cue-weighting strategies towards the native norms, we calculated the change in the difference between each learner’s PQ and the Spanish natives’ average PQ from T1 to T2 (Formula 3). A positive score means that the Chinese student’s cue-weighting became closer to that of the Spanish natives at T2 as compared to T1, which indicates a cue-weighting shift towards the native norm. A negative score means the opposite and zero means no shift. We labelled this score as ‘Cue-Weighting Shift Score’.
2.3.3. Measure for the amount of L2 input
As this study focuses on speech perception, we only coded the Chinese students’ answers to the listening experience during SA, which reflected the amount of L2 input. For each participant, we aggregated the self-estimated number of hours per day listening to Mandarin (HL1) and Spanish (HL2) for all the activities. Following Li et al. (2020), we divided HL2 by HL1 to account for the individual biases in self-estimation. The resulting variable was labelled as ‘L2 Input Score’, which is a ratio between the amount of L2 and L1 input. See Formula (4).
2.4. Statistical analysis
The statistical analyses were performed in R (R Core Team, 2014) using the lme4 package (Bates et al., 2015) with the p-value estimated by the lmerTest package (Kuznetsova et al., 2017). The data visualization was performed using ggplot2 in tidyverse (Wickham, 2016), interactions (Long, 2019) and ggdist (Kay, 2024) packages.
To assess Hypothesis 1, we conducted two generalized linear mixed-effects models (GLMM). The regression coefficients of GLMMs can assess the extent to which the listeners relied on a specific acoustic cue to perceive lexical stress. Larger coefficients indicate more reliance and a crispier categorical boundary (Morrison, 2007). The dependent variable was the participants’ binary responses (1 = SW, 0 = WS). Model 1 assessed whether the Chinese students had changed their perceptual categorization of lexical stress from T1 to T2. The independent variables included Pitch Step, Duration Step, Session (T1 vs. T2) as well as all possible interactions.
Model 2 assessed whether the Chinese students at T2 differed from Spanish natives in the perceptual categorization of lexical stress. The independent variables included Pitch Step, Duration Step, Group (Chinese students vs. Spanish natives), and all possible interactions.
We added a random intercept of participant with Pitch Step and Duration Step being random slopes for each model. For Model 1, as the participants were tested twice (T1 and T2), we also included Session as a by-participant random slope.
To assess Hypothesis 2, we built a linear model (Model 3) with Cue-Weighting Shift Score being the dependent variable. The independent variables included the Pitch Discrimination Score, Duration Discrimination Score, L2 Input Score, and all possible interactions. All continuous predictors were z-score transformed prior to analysis.
3. Results
3.1. The change of lexical stress categorization over time
Figure 1 plots the percentage of SW responses given by the participants to each of the 49 synthesized stimuli. In what follows, we report on the statistical results of the two comparisons outlined in Section 2.

Proportion of strong–weak (SW) responses to the synthesized stimuli ‘PASO’ in the Spanish lexical stress perception test, split into Spanish natives, Chinese students before study abroad (SA) (T1) and after SA (T2). /a/ is the vowel of the first syllable ‘pa’; /o/ is the vowel of the second syllable ‘so’.
3.1.1. Comparing the Chinese students’ lexical stress perception from T1 to T2
The analysis of Model 1 (Table 1) revealed the following results. First, there were significant main effects of Pitch Step and Duration Step, which means that an increase in pitch or duration differences between the vowels /a/ and /o/ resulted in a significantly higher proportion of SW (paso) responses at T1, as T1 was the reference level.
Summary of Model 1.
Notes. Model formula: response ~ pitch step * duration step * session + (duration step + pitch step + session | participant). Session is a factorial variable with dummy coding; reference level = T1.
Second, the significant two-way interactions of Pitch Step × Session and Duration Step × Session mean that the effects of Pitch Step and Duration Step on the participants’ responses changed from T1 to T2 (Figure 2). Both interactions revealed positive coefficients, which means that both acoustic cues had stronger effects on the participants’ responses at T2 than at T1. In other words, at T2, participants showed more pitch and duration cue-weighting for lexical stress categorization, although the improvement on pitch (β = 0.86) seemed to be larger than on duration (β = 0.45).

The proportion of strong–weak (SW) responses as a function of pitch step (left panel) and duration step (right panel) given by Chinese students split by session (T1 vs. T2).
3.1.2. Comparing the Chinese students’ lexical stress perception at T2 to that of Spanish native speakers
The analysis of Model 2 (Table 2) revealed the following results. First, there were significant main effects of Pitch Step and Duration Step. This suggests that an increase in pitch or duration differences between /a/ and /o/ led to a higher proportion of SW responses in Spanish natives. The significant interactions of Pitch Step × Group (Figure 3, left panel) and Duration Step × Group (Figure 3, right panel) suggest that the Chinese students at T2 differed from the Spanish natives in terms of the pitch and duration cue-weighting for Spanish lexical stress categorization. Concretely, compared to Spanish natives, the Chinese students relied more on pitch (β = 2.13) but less on duration (β = −1.10) to perceive Spanish lexical stress contrast.
Summary of Model 2.
Notes. Model formula: response ~ pitch step * duration step * group + (pitch step + duration step | participant). Group is a factorial variable with dummy coding; reference level = Spanish Natives.

The proportion of strong–weak (SW) responses as a function of pitch step (left panel) and duration step (right panel) given by Spanish natives and Chinese students at T2.
Interestingly, there was a significant Pitch Step × Duration Step interaction on the reference level (Spanish natives). This means that Spanish natives’ reliance on pitch cue for Spanish lexical stress categorization was conditioned by duration cue. The positive coefficient (0.16) means that an increase in Duration Step led Spanish natives to give more SW responses as a result of an increase in Pitch Step. In other words, the Spanish natives used both pitch and duration cues for lexical stress categorization. By contrast, the Chinese students did not show such a performance pattern, which indicates that Duration Step did not significantly affect the effect of Pitch Step on their responses at both T1 and T2; this can be concluded from the Model 1 (for the non-significant interactions of Pitch Step × Duration Step and Pitch Step × Duration Step × Session, see Table 1).
To summarize, Chinese students predominantly relied on pitch for L2 Spanish lexical stress categorization before and after SA. Spanish natives relied on both pitch and duration cues although duration was stronger. After SA, the Chinese students upweighted both pitch and duration cues for Spanish lexical stress categorization, with pitch remaining dominant.
3.2. The role of auditory acuity abilities and the amount of L2 input in L2 cue-weighting shift
Table 3 summarizes the descriptive statistics of all the scores related to this section. In the supplementary materials, we provide a descriptive figure of the percentage of SW responses given by the Chinese students to each of the 49 synthesized stimuli, split by pitch discrimination abilities and amount of L2 use.
Descriptive statistics of pitch discrimination score (in Hertz), duration discrimination score (in milliseconds), L2 input score, pitch cue-weighting, and cue-weighting shift score.
Although we included Duration Discrimination Score as an independent variable, an initial analysis did not reveal any significant main effect of Duration Discrimination Score or its interaction with other factors. We therefore removed Duration Discrimination Score from Model 3, which increased the adjusted R2 from 0.49 to 0.53. The final Model 3 involved the scores of Pitch Discrimination, L2 Input, and their interaction.
Model 3 (Table 4) did not reveal significant main effects of Pitch Discrimination Score or L2 Input Score on Chinese students’ Cue-Weighting Shift Score. However, there was a significant two-way interaction of L2 Input Score × Pitch Discrimination Score. Recall that the higher the Pitch Discrimination Score, the less sensitive one is to perceiving pitch differences. The positive coefficient means that after SA, as L2 Input Score increases, those who were less accurate in perceiving pitch cue showed a higher Cue-weighting Shift Score. In other words, participants with lower pitch acuity may not have sufficient access to pitch information. Therefore, an increased amount of L2 input may have drawn their attention to alternative cues such as duration and shifted their cue-weighting patterns towards native norms. As both independent variables are continuous variables, Figure 4 plots three regression lines of Cue-Weighting Shift Score as a function of L2 Input Score estimated at the mean value, one Standard Deviation above (+ 1 SD) and below (− 1 SD) the mean value of the Pitch Discrimination Score, respectively.
Summary of Model 3.
Notes. Multiple R2 = 0.58; Adjusted R2 = 0.53. Independent variables are centered at zero.

Chinese students’ cue-weighting shift after study abroad (SA) in perceiving L2 Spanish lexical stress predicted by the pitch discrimination ability and the amount of L2 input during SA.
4. Discussion and conclusions
This longitudinal study investigated how individual differences in domain-specific auditory acuity and the amount of L2 input affect the cue-weighting shift in L2 perceptual categorization after a Study Abroad (SA) program. Overall, the L1-Mandarin–L2-Spanish learners showed a more categorical perception pattern of Spanish lexical stress after SA. The Chinese students upweighted both pitch and duration for Spanish lexical stress perception, with pitch always being the primary perceptual cue. There were considerable individual differences among the participants. Specifically, learners with lower pitch acuity showed more cue-weighting shift towards the Spanish norms given more Spanish input during SA. In what follows, we will organize our discussion by addressing the hypotheses.
Our results confirmed Hypothesis 1 that after SA, the L2 learners would show clearer lexical stress categories than before SA. This is validated by the significantly larger coefficients of both Pitch Step and Duration Step at T2 than at T1. The enhanced effects of pitch and duration can be viewed as an improvement in suprasegmental categorization after SA. However, pitch still remained the primary perceptual cue. Specifically, while learners significantly upweighted duration at T2, they upweighted pitch with a greater magnitude (see Model 1). Consequently, despite the improvement in duration use, the group-level results at T2 showed a larger divergence from the native norms compared to T1. Previous studies revealed similar findings among L1 Mandarin listeners with long-term immersion in the target L2 society, showing a tendency to place more weight on pitch in suprasegmental categorization (Li and Xi, 2024b; Petrova et al., 2023; Wang et al., 2024). The group-level analysis suggests that it is not easy to observe a cue-weighting shift at the group level even in an L2 immersion context. However, due to considerable individual differences among learners, the picture may be different when examining the SA effects at the individual level.
More importantly, the Chinese students’ duration-cue reliance became larger after SA, which suggests a potential cue-weighting shift at the group level. This finding conforms to a previous study where long-term residence in the target L2 society led to more duration-cue reliance for Mandarin listeners’ L2 suprasegmental categorization, compared to short-term residence (Petrova et al., 2023). In practice, Spanish lexical stress can be marked by multiple cues, including pitch, duration, and intensity (Hualde, 2012; Hualde and Prieto, 2015; Ortega-Llebaria and Prieto, 2011). Learners are free to choose whichever cue(s) to identify lexical stress in real-life communication. If paying attention to pitch can already identify stress most of the time, learners are likely to reinforce this perceptual strategy. As a result, the learners did not show a trade-off between pitch and duration weightings. Instead, SA led to a more categorical perceptual pattern with both cues, which allowed learners to integrate new strategies without abandoning the L1-transferred acoustic cues. However, at the sentence level, pitch is not always a reliable cue for Spanish lexical stress, whereas duration plays a more robust role (Torreira et al., 2014). Therefore, increased exposure to the L2 during SA may lead learners to start shifting their perceptual cue-weighting, although with considerable individual variation.
Our data support Hypothesis 2 that, at the individual level, auditory acuity interacts with the amount of L2 input on the cue-weighting shift. Specifically, lower pitch acuity may signal a more flexible cue-weighting shift, which leads participants to reduce pitch-cue reliance relative to duration to perceive lexical stress after SA given sufficient L2 input. By contrast, more L2 input could lead listeners with higher pitch acuity to rely more on pitch cues. In general, tone language speakers are sensitive to pitch changes and thus show stable pitch-cue reliance in the perception of speech and non-speech stimuli (Bidelman et al., 2013; Chandrasekaran et al., 2009; Krishnan et al., 2009). However, our data suggest that even within tone language speakers, individual differences in auditory acuity still result in different L2 learning outcomes. Participants with lower pitch acuity scores can recalibrate pitch-cue reliance with an increased amount of L2 input. This is likely because they paid less attention to pitch information, which reduced the significance of pitch as a critical acoustic cue in their processing of suprasegmental features. More importantly, our findings can possibly explain why previous studies revealed limited SA effects on L2 perceptual cue-weighting shift (Cebrian, 2006; Ingvalson et al., 2011), which might have been hindered by the individual variability across participants.
Duration discrimination ability could not significantly predict suprasegmental cue-weighting shift in L2 perceptual categorization. This finding aligns with the distinction between auditory acuity tested with non-speech stimuli and perceptual performance in L2 speech. For instance, L1 Japanese listeners can discriminate F3 differences in non-speech stimuli yet fail to perceive English /r/–/l/ distinction using the critical F3 cue (Miyawaki et al., 1975). Perhaps auditory acuity reflects the processing abilities at a lower order (Xu et al., 2024). It cannot fully turn over the L1-to-L2 cue-weighting transfer in lexical stress perception which happens at a higher-order cognitive level. However, more empirical evidence is needed to test this hypothesis. Moreover, a specific domain of auditory acuity may affect L2 learners’ cue-weighting in perceptual learning when the acoustic cue is relevant in their L1. In our case, duration is a secondary acoustic cue for Mandarin word-level prosody. It may not be easy to detect a significant predictive role of duration acuity in perceiving L2 word-level prosody. Liu (2022) showed that when perceiving L2 English /d–t/ contrast, Mandarin listeners’ duration acuity was related to their duration-cue reliance (VOT), but their pitch acuity and pitch-cue reliance (initial F0) did not correlate. This mirrors our results as duration (VOT) is more relevant than pitch for Mandarin stop contrast, whereas pitch is more relevant than duration for Mandarin word-level prosody. In line with Liu’s (2022) study, another longitudinal study by our team showed that duration acuity predicted Chinese students’ change of the categorical perception of L2 Spanish stops using VOT as the sole acoustic cue (Xi et al., 2026). The role of auditory acuity in L2 cue-weighting seems to be conditional on the relevance of the specific acoustic cue in the learners’ L1. Finally, our sample size and length of immersion might not allow sufficient power to find significant results on duration acuity. Future studies might want to continue exploring this hypothesis with a larger group of learners over a longer period of observation.
The current study leads to several theoretical contributions. First, we provide longitudinal evidence showing that learners’ L1 cue-weighting largely determines their L2 cue-weighting (Francis and Nusbaum, 2002; Francis et al., 2000; Kondaurova and Francis, 2008; Zhang and Francis, 2010), with a less-studied language pair, Mandarin-Spanish. In our case, pitch is consistently the primary cue for L1 Mandarin listeners’ perception of L2 lexical stress at the group level. Second, our results expand the cue-weighting theory by showing that the cue-weighting shift in L2 perception is driven by two key factors: individual auditory acuity and the amount of L2 input. In other words, although adults show consistent cue-weighting strategies for speech perception (Jasmin et al., 2020), there is still potential to recalibrate the cue-weighting patterns. This implies that, provided with proper training methods, L2 learners can gradually shift their perceptual strategies of L2 suprasegmental categorization towards the native norms.
The current study has certain limitations that future studies might want to address. First, our Chinese students were all graduates and moved to Spain for a master’s program taught in Spanish rather than learning the Spanish language per se. It was therefore not possible to have a parallel ‘at-home’ control group in China. A future study might want to focus on learners of Spanish language to advance this topic. Second, as cue-weighting shift typically needs a long time, a longitudinal study with a longer time frame will be welcome for better understanding the development of L2 perceptual learning. In addition, we relied on self-reports to measure L2 input. Although we carefully controlled for estimation bias by using an L2/L1 ratio, future studies may want to use more ecological methods such as language diaries, or objective tracking to quantify input more precisely. It would be useful to record changes in L2 use during SA rather than surveying it after SA to dynamically model how it links to the changes in the dependent variable such as cue-weighting shift (for comprehensive methodological recommendations, see Nagle and Zárate-Sández, 2024). Finally, as noted by an anonymous reviewer, learners’ cue-weighting shift may be a proxy of L2 proficiency change after SA. In the present study, however, learners reported their Spanish proficiency levels based on different official tests. Therefore, the proficiency data was not a reliable individual variable for statistical analysis. A replication study using a uniform, dedicated Spanish proficiency assessment would be welcome to examine potential relationships between L2 proficiency gains and cue-weighting shifts.
To conclude, the current study provided longitudinal evidence in a less-investigated language pair – L1-Mandarin–L2-Spanish – to demonstrate that individual differences in auditory acuity and the amount of L2 input jointly affect adult learners’ cue-weighting strategies in L2 suprasegmental perception. Therefore, adult learners still show flexibility in L2 speech learning given proper conditions. Our findings highlight the need for taking individual approaches when investigating the development of L2 learning, especially in naturalistic settings.
Supplemental Material
sj-docx-1-slr-10.1177_02676583261441156 – Supplemental material for The interaction of auditory acuity and L2 input in suprasegmental cue-weighting shifts after study abroad: Mandarin speakers’ perception of Spanish lexical stress
Supplemental material, sj-docx-1-slr-10.1177_02676583261441156 for The interaction of auditory acuity and L2 input in suprasegmental cue-weighting shifts after study abroad: Mandarin speakers’ perception of Spanish lexical stress by Peng Li, Xiaotong Xi and Clara D. Martin in Second Language Research
Footnotes
Appendix A
Appendix B
Acknowledgements
The authors would like to thank Sara Coego for her support in creating the experimental stimuli.
Ethical approval
This study was approved by the Norwegian Agency for Shared Services in Education and Research (SIKT) under the project title Spanish pronunciation by Chinese learners (reference number 732455).
Consent for publication
All participants provided written consent, allowing the researchers to collect and analyse their personal data for research purposes.
Funding
The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This research is supported by the Basque Government through the BERC 2022–25 program and is funded by the Spanish State Research Agency through BCBL Severo Ochoa excellence accreditation (CEX2020-001010/AEI/10.13039/501100011033), the Spanish Ministry of Science and Innovation through the Juan de la Cierva program funded by ‘NextGenerationEU’/PRTR (JDC2022-048729-I to P.L.), the Spanish Ministry of Economy and Competitiveness (PID2023-148756NB-I00 to C.D.M.), and the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation program (grant agreement number 819093 to C.D.M.).
Declaration of conflicting interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Data availability statement
Supplemental material
Supplemental material for this article is available online.
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
