Sage Journals: Discover world-class research

Abstract

This study examines the interrelationships among the language learning context, cross-language perceptual mapping, and individual differences in the acquisition of perceptual categorization in a study abroad program over one academic year. Thirty Mandarin speakers learning Spanish completed an identification task that employed Spanish /p-b, t-d, k-ɡ/ contrasts on three continua differing in voice onset time upon arrival and at the end of their program. Individual differences measures included auditory acuity of duration and Spanish language use during the study abroad period. Their perceptual performance on the Spanish voicing contrast was compared to that of a Mandarin–English control group and a Spanish Native Group. At the group level, the learners’ perceptual performance showed little change after studying abroad and fell between that of Chinese–English controls and the Spanish natives. However, at the individual level, greater auditory acuity of duration and more Spanish use predicted improved perceptual performance from the beginning to the end of the program. The findings suggest that while the formation of second language perceptual categories can be challenging, individual differences can modulate development over time.

Keywords

Voice onset time perceptual categorization auditory acuity L2 use study abroad

1 Introduction

Learning context can largely affect second language (L2) learners’ development of new sound categories and directly determine the amount and type of input learners experience by affording different opportunities for interaction in the L2. In the case of perceptual categorization, the target of the present study, study abroad (SA), provides learners with repeated, sustained exposure to the L2. Therefore, it can facilitate the development of phonetic categories through a combination of implicit, distribution-based learning and explicit, communication-driven learning whereby learners adjust their L2 categories for communicative ends (Nagle & Zárate-Sández, 2024; Netelenbos & Li, 2013). More importantly, even in SA contexts where learners have abundant, rich exposure to the L2 over a concentrated period, individual differences can affect the way learners interact with the input at various levels. In this longitudinal study, we examine how auditory acuity and L2 use independently modulate L1 Mandarin listeners’ L2 Spanish perceptual categories within an L2 SA context. This language pair has received less attention in the L2 speech perception literature.

The Mandarin speakers in our study began learning Spanish at the start of their undergraduate degree and had previously learned English as a mandatory school subject. In terms of order of acquisition, Spanish is their third language (L3). In the present study, we adopt the classic definition of L2 acquisition as the learning of any additional language after the first language (L1), regardless of its numeric order (Saville-Troike & Barto, 2017). Several considerations motivate this choice. Methodologically, L3 transfer research typically targets the initial stages of acquisition (Rothman, 2015), whereas our participants were not ab initio Spanish learners, who may not be ideal subjects for L3 studies (Cabrelli Amaro, 2012). Theoretically, our study examines Spanish stop voicing, for which prevoicing is the primary acoustic cue, which is absent in Mandarin and not frequent in English (Hunnicutt & Morris, 2016; Lee & Zee, 2003). Empirically, prior evidence shows that English does not significantly influence L1 Mandarin listeners’ perception of Spanish voicing contrasts (Liu et al., 2019). Given these factors, our study aligns more naturally with L2 research, and we therefore refer to Spanish as the learners’ L2 for theoretical clarity and consistency.

1.1 Theoretical models of L2 speech acquisition

Scholars have proposed several models to account for L2 speech acquisition, including the Speech Learning Model (SLM; Flege, 1995), its revised version (SLM-r; Flege & Bohn, 2021), and the L2 Perceptual Assimilation Model (PAM-L2; Best & Tyler, 2007). SLM and SLM-r propose that successful L2 sound learning depends on accurately perceiving the L2 phonetic category. Specifically, establishing a new L2 category depends upon the way it assimilates into existing native categories. Importantly, when the L2 sound and L1 sound are similar but not identical, learners may fail to form a new category, resulting in assimilation of L1 and L2 categories. According to the SLM and SLM-r, adult learners retain the capacity to form novel L2 categories that coexist with L1 categories in the same phonetic space. At the same time, individual differences, such as auditory acuity and L2 input quality and quantity, are hypothesized to strongly influence this process (Flege & Bohn, 2021).

PAM-L2 focuses on how learners assimilate L2 sounds to existing L1 categories, as L2 perceptual categorization is shaped by the L1 phonological system (Best & Tyler, 2007). At the initial stage, learners map an L2 category to the closest L1 category depending on their perceptual similarity. If two L2 sounds correspond to two distinct L1 categories (two categories, TC), discrimination between the two L2 sounds is easy. If two L2 sounds are mapped to a single L1 category (single category, SC), discrimination of the L2 contrast is difficult. Between these lies “category goodness” (CG) assimilation, where both L2 sounds map to one L1 category but differ in perceived fit, leading to moderate difficulty in distinguishing the L2 sounds.

In the context of the current study, the SC pattern is of particular relevance for Mandarin-speaking learners acquiring Spanish stops. Spanish /b, d, ɡ/ are realized as voiced stops with negative Voice Onset Time (VOT) following a pause or following nasal consonants (/d/ is also a stop after /l/); they are lenited to approximants [β, ð, ɣ] elsewhere. Spanish /p, t, k/ are voiceless stops with short-lag VOT (Martínez-Celdrán et al., 2003). In contrast, for languages such as Mandarin, there exists only an aspiration contrast (i.e., aspirate vs. unaspirated). Mandarin does not have voiced stops on the phonological level and does not show evidence for negative VOT on the phonetic level. Mandarin unaspirated /p, t, k/ are realized with short-lag VOT, while aspirated /pʰ, tʰ, kʰ/ show a long-lag VOT (Chao & Chen, 2008; Lee & Zee, 2003). Consequently, L1 Mandarin speakers typically assimilate voiced and voiceless stop categories into Mandarin unaspirated /p, t, k/ (Feng & Busà, 2022; Li & Ye, 2025; Liu et al., 2019; Liu & Lin, 2021; Xi & Li, 2022, 2023). Thus, Mandarin speakers face a twofold challenge when learning the L2 voicing contrast in Spanish. First, they must establish a new phonetic category for voiced stops. Second, they must learn to realize this new category with negative VOT. Crucially, while PAM-L2 accounts for early learning, it also predicts that accumulated L2 use can drive the formation of these new phonetic categories, allowing experienced learners to outperform novices in L2 perception.

To summarize, perceptual similarity between L1 and L2 categories predicts difficulties in perceiving new sound categories within both SLM/SLM-r and PAM-L2. More importantly, both theoretical models adopt a dynamic view of L2 perception, predicting improvement with L2 use. The SLM-r also emphasizes a strong role for individual differences in the acquisition of L2 sound categories, whereby factors such as auditory acuity may influence development. In the current study, we take a longitudinal approach to investigating the perceptual acquisition of L2 sound categories in a SA context while considering two important individual factors: auditory acuity and L2 use.

1.2 The acquisition of L2 Spanish stops

While theoretical frameworks provide predictions for speech learning, empirical research documenting the longitudinal development of L2 Spanish stops in both classroom and immersion contexts is unevenly distributed across L1 backgrounds. To date, this line of research has primarily focused on learners whose L1 is English. In English, “voiced” /b, d, ɡ/ can be phonetically realized with short-lag VOT, or occasionally, with negative VOT (Flege, 1982; Hunnicutt & Morris, 2016; Ladefoged, 1998; Lisker & Abramson, 1964; Roach, 2004), whereas the “voiceless” /p, t, k/ are aspirated in word-initial position and in stressed syllables, except after /s/ (Ladefoged, 1998; Roach, 2004). Therefore, English speakers in general show difficulties in perceiving L2 Spanish stop contrasts, which has been investigated in some longitudinal studies.

Zampini (1998) trained L1 English learners’ perception and production of the Spanish /p-b/ contrast over one semester. The learners had advanced Spanish proficiency and were asked to identify Spanish non-words “pada” and “bada” with the initial stop /p-b/ varying in the length of VOT from −40 ms to 56 ms. The results showed that with phonetic training, learners shifted their perceptual boundary location of /p-b/ contrast toward the Spanish norms, although the shift was not sustained at the end of the semester.

Nagle (2018) conducted a one-year longitudinal study examining the perceptual development of the Spanish /p-b/ contrast in L1 English listeners enrolled in a second-semester college Spanish course. Participants completed a picture-matching task with auditory input that consisted of words embedded in sentences. The results showed continuous improvement in identification accuracy over one academic year, but accuracy rates remained significantly lower than those of L1 Spanish listener controls at the end of the study.

In a quasi-longitudinal study, Casillas (2020) examined L1 English absolute beginners’ perception and production of Spanish stops over a 7-week domestic language immersion program. Participants completed a perceptual categorization task using stimuli drawn from a synthesized 13-step VOT continuum, ranging from −60 ms to 60 ms, anchored by the lexical items bata “robe” or pata “paw” at either extreme. By the third week, the L2 learners had started to shift their boundary location of the Spanish /p-b/ contrast toward the target norm and continued doing so (except in the sixth week) until the end of the program. Finally, the learners showed performance similar to Spanish–English bilinguals, who served as a benchmark for successful, high-proficiency bilingualism.

These longitudinal results show that L1 English-speaking learners’ perception of L2 Spanish stops can fall within the boundaries of native Spanish listener categories, and the cross-language mapping challenges occur at the phonetic level. Therefore, L1 English learners may initially assimilate the Spanish /b, d, ɡ/ and /p, t, k/ to English /p, t, k/, but with increased learning experience and proper training, they can adjust the phonetic details of the “voiced” category.

In contrast, as previously noted, L1 Mandarin-speaking learners’ challenges of learning L2 voicing contrasts lie at both the phonological and phonetic levels. Their boundary location of L2 voicing contrasts is further along the voicing continuum than L1 Spanish speakers (Liu et al., 2019; Liu & Lin, 2021). For instance, L1 Mandarin learners with advanced level Spanish proficiency who had lived in Spain for 1.2–1.8 years showed a significantly different perceptual boundary location (23.04 ms) from Spanish monolinguals (−2.27 ms) for the Spanish /p-b/ contrast (Liu et al., 2019). More recently, Bravo Díaz (2025) systematically assessed the perception and production of Spanish stops by Mandarin-speaking learners across different proficiency levels. Notably, while this contrast proved perceptually difficult, performance improved with increased overall proficiency. Nevertheless, the study found that even advanced learners did not reach target-like¹ performance.

To summarize, L1 Mandarin listeners need to form a new phonetic category for accurate perception of the Spanish voiced-voiceless contrast. Previous work has contributed to our understanding of how Mandarin listeners perceive Spanish stops, but unlike work with English-Spanish learners, we lack longitudinal data that clarifies how the process unfolds over time. The present longitudinal study will provide insight into the development of new sound categories in a lesser-studied multilingual population and furthermore allow us to examine the effect of individual differences during SA.

1.3 Auditory acuity and the amount of L2 use in L2 speech acquisition: the present study

In addition to challenges related to cross-language perceptual mapping, individual differences also affect L2 learning outcomes (Suzukida, 2021). These individual factors generally fall into three categories: learner experience (e.g., foreign language education, daily L2 use, and length of formal learning or immersion); cognitive abilities (e.g., executive function and aptitude); and age (e.g., chronological age and age of acquisition) (Saito, 2023). In the present study, participants showed little variability in the age factor (see Section 2.1). However, we did consider variability in terms of Spanish use and individual aptitude. In SA contexts, the same length of residence abroad does not necessarily entail comparable L2 use (Flege & Bohn, 2021), as some learners actively seek contact with the L2 community while others do not (Borràs & Llanes, 2021). Thus, accurately documenting the amount of L2 use is crucial, especially given inconsistent findings regarding its role. For instance, Casillas (2020) found that a 7-week domestic immersion program successfully shifted American English listeners’ perceptual boundary of Spanish /p-b/ toward the target norm, independent of language use. In contrast, Turner (2025) showed that the amount of French input positively correlated with English speakers’ production accuracy of French /u/ after 6 months of residence abroad. In the present study, we surveyed participants’ Spanish use with a detailed questionnaire to capture their L2 use during SA and its impact on perceptual development.

Beyond experiential and age factors, aptitude also strongly predicts L2 learning outcomes and interacts with variables such as attitude, motivation, personality, general intelligence, and musical ability (Abrahamsson & Hyltenstam, 2008; Dewaele & MacIntyre, 2019; Li et al., 2026; Li, Ioannidou, et al., 2024; Li, Zhang, et al., 2024; Nardo & Reiterer, 2009; Nowbakht & Fazilatfar, 2019; Ożańska-Ponikwia & Dewaele, 2012). Recent research highlights a neglected aspect of aptitude, auditory acuity (i.e., the sensitivity to acoustic dimensions such as duration), which can cascade into broader perceptual learning effects (Saito, 2023). Learners with finer auditory acuity make greater perceptual gains when immersed in the L2 (Sun et al., 2021), and auditory acuity is a moderate to strong predictor of L2 speech learning for individuals with 1 to 10 years of study or residence abroad experience (Kachlicka et al., 2019; Saito, Sun, et al., 2022), but less predictive for those with short-term SA (under 6 months) (Saito, Sun, et al., 2022) or no immersion at all (Saito et al., 2021).

Turning to fine-grained phonetic features, auditory acuity positively associates with the acquisition of segmental and prosodic accuracy in L2 learning (Saito, Sun, et al., 2022). In perceiving fine-grained phonetic detail, auditory acuity of formant correlates with the accurate perception of difficult L2 contrasts such as English /i-ɪ/ and /l-r/ among L1 Japanese listeners (Saito, Cui, et al., 2022; Saito, Kachlicka, et al., 2022; Saito, Sun, et al., 2022). With respect to L2 stop perception, Liu (2022) reported that L1 Mandarin listeners with better auditory acuity of duration relied more on VOT to perceive the English /t-d/ contrast than those with lower ability. However, although initial fundamental frequency (F0) is also a relevant cue for English /t-d/, auditory acuity of pitch did not affect the listeners’ perception. Together, these findings suggest a link between auditory acuity and L2 speech learning. There is as yet no research on the role of auditory acuity in the longitudinal development of new sound categories in a SA setting. The present study addresses this gap by focusing on the role of auditory acuity of duration in the development of Spanish VOT perception over time. Since VOT is a duration-dependent measure, we hypothesize that participants with better auditory acuity of duration will show greater improvement in Spanish stop category development.

In short, we included two individual difference factors as predictors for the current study: individual auditory acuity and the amount of L2 input during a SA program. From a pedagogical perspective, understanding the roles of these individual differences is critical for maximizing the efficacy of SA programs. If learning is primarily driven by the amount of L2 use, then SA programs should prioritize resources that maximize social interaction and engagement. However, if aptitude factors play a decisive role, increasing language use alone may be insufficient for certain learners. Identifying these constraints is the first step toward optimizing L2 speech development in naturalistic settings. Accordingly, in the present study, we formulate the following research questions:

RQ1: How does the L1 Mandarin listeners’ perceptual categorization of Spanish stop voicing contrasts change after SA?

Hypothesis 1 (H1): Following the SA experience, L1 Mandarin listeners will improve their perception of Spanish stop voicing contrasts, as reflected by category boundaries shifting toward those of native Spanish listeners and steeper perceptual slopes indicating increased categorical perceptual patterns.

RQ2: To what extent are auditory acuity of duration and language use associated with changes in the perceptual categorization of Spanish stop voicing contrasts during SA?

Hypothesis 2 (H2): Higher individual auditory acuity and greater Spanish use during SA will be positively associated with changes in the perceptual categorization of Spanish stop voicing contrasts, resulting in patterns that more closely approximate those of native Spanish listeners.

To answer RQ1, we used an identification task with three synthesized VOT continua for /p-b, t-d, k-ɡ/ to test the participants’ perceptual performance before and after SA. To answer RQ2, we used a duration discrimination test and a questionnaire to elicit individual difference data.

2 Method

2.1 Participants

A total of 73 participants were recruited for the current study, forming three groups: the Experimental Group, the Control Group, and the Spanish Native Group. All signed written consent to participate. None reported any speech or hearing impairments. Our original intent was to balance gender across all groups, but only one male participant responded to our call for participants. Therefore, we decided to recruit only female participants for all three groups.

The Experimental Group consisted of 30 female Mandarin Chinese-speaking learners of Spanish (M_age = 22.23 years, SD = 0.68). The participants had completed their undergraduate degree in Spanish language studies and were from various Chinese universities. The mean age at which participants began to learn Spanish was 18.07 years (SD = 0.58), at the start of their undergraduate program. When they arrived in Spain, they had received instruction in the Spanish language for an average of 4.17 years (SD = 0.46) in China. None of the students had studied abroad prior to their arrival in Spain. Their Spanish proficiency levels upon arrival in Spain ranged from B1 to C1 as measured by one of two standard Spanish proficiency tests: DELE (Diploma de Español como Lengua Extranjera) and SIELE (Servicio Internacional de Evaluación de la Lengua Española), according to the Common European Framework of Reference for Languages (Council of Europe, 2001). In addition, all participants had passed the College English Test-Level 6 (intermediate-to-advanced level proficiency), administered to Chinese college students. See Appendix A for the detailed demographic information.

During their stay in Spain, participants were enrolled in a Master’s Degree program with Spanish as the language of instruction. The Master’s program included courses in Discourse Analysis, Historical Studies, Latin American Studies, Journalism, and Literature Studies. The Experimental Group participants were recruited from Barcelona, which exposed them to Catalan during their SA. As Catalan shows very similar laryngeal configurations to those of Spanish (Carbonell & Llisterri, 1992), we consider that the limited exposure to Catalan would not likely affect the results. Two participants reported having been enrolled in advanced Spanish language training courses during SA, and both confirmed that the course focused on improving reading, writing, and conversational skills in general, with no phonetic training dedicated to specific phonemes. The longitudinal observation lasted on average 248.33 days (SD = 5.35), starting from October 2022 (T1 spanning 11 days) to June 2023 (T2 spanning 14 days).

The Control Group consisted of 24 female Mandarin-speaking learners of English (M_age = 20.29 years, SD = 1.12), who had never been exposed to Spanish and had never studied or resided abroad. They were English-language majors at a public university in China. The Spanish Native Group was comprised of 19 female speakers of Castilian Spanish (M_age = 21.11 years, SD = 3.05) to provide reference data for Spanish VOT perception. All were from Spain and reported growing up in monolingual Castilian-speaking regions. Although they had passive exposure to Catalan in daily life because they were living in Barcelona, none reported fluent knowledge of Catalan.

2.2 Materials and procedure

This study was part of a larger project examining L1 Mandarin learners’ phonological development of Spanish during an SA program. The Experimental Group completed a larger battery of tests, including production as well as perception and individual difference tasks. Here, we report on a subsection that includes the perception task, the duration discrimination task, and the questionnaires. The Spanish natives and the Control Group were tested only once. The ethical approval of this study was issued by the Norwegian Agency for Shared Services in Education and Research (SIKT) and the Academic Committee of the first author’s institution.

2.2.1 Stop perception task

To test the perception of Spanish stop consonants, we prepared VOT continua of bilabial, dental, and velar stops. A female Spanish speaker was recorded in a soundproof room via Shure SM35 headphones connected to the Zoom H4n Pro recorder with a sampling rate/resolution of 44.1kHz/16-bit. The speaker produced three word pairs beso-peso “kiss-weight,” día-tía “day-aunt,” and goma-coma “rubber-comma,” each containing the stop sound at the initial position. Following Winn (2020), each pair of words was uploaded to Praat (Boersma & Weenink, 2017) to synthesize stimuli, resulting in the VOT continuum ranging from −60 ms to 60 ms in 10 ms increments. F0 and intensity were kept constant across all the stimuli. In this way, we obtained 13 tokens for each continuum.

The identification task was presented using Praat. Each of the 13 tokens was repeated four times in a randomized order. In total, there were 156 trials (13 tokens × 3 continua × 4 repetitions). In each trial, on-screen instructions (e.g., “peso → Z” and “beso → M”) prompted participants to press either “Z” or “M” on the keyboard to indicate whether the stimulus contained a voiced or voiceless stop. The assignment of response keys was counterbalanced across participants: for instance, half saw “peso → Z” and “beso → M,” while the other half saw “beso → Z” and “peso → M.” The target words were presented in Spanish. Spanish orthography is almost fully transparent, which allowed even the Control Group to infer letter-to-sound mappings based on their knowledge of the Latin alphabet. No explicit explanation of the Spanish words’ meanings was provided. Each trial was presented to the participants with a 500 ms interval. After every 52 trials, participants were allowed to take a break and could resume the task by pressing the space bar. All three groups completed the stop perception task.

2.2.2 Duration discrimination task

The duration discrimination task was taken from the L2 auditory processing test batteries (Mora-Plaza et al., 2022). Participants completed an AXB task that consisted of a series of three 350 Hz tones, of which either A or B matched X in duration. The stimulus duration ranged from 250 ms (Level 0) to 500 ms (Level 100), in increments of 2.5 ms. The task comprised a continuum of 100 synthesized stimuli, with the task beginning at stimulus level 50, and adjusted for difficulty based on participants’ responses. The numbers “1” and “3,” displayed on the screen, prompted participants to identify the stimulus that was different. After three correct responses, the difficulty level of the subsequent trial increased, while one incorrect response resulted in a decrease in difficulty. The system automatically generated participants’ discrimination scores on a 100-point scale. Only the Experimental Group completed the duration discrimination task at T1.

2.2.3 Questionnaires

Adapted from the Language History Questionnaire (Li et al., 2020) and the L2 English Experience Questionnaire (Sun et al., 2024), we designed two questionnaires. The Linguistic Background Questionnaire was administered at pre-SA and included questions related to the participants’ demographic information and language history, such as current age, gender, onset of L2 learning, length of L2 learning, and language proficiency (Appendix B). The Language Use Questionnaire was administered at post-SA and focused on the participants’ language usage during SA, such as hours spent engaging in various listening and speaking activities in both the L1 and the L2 (Appendix C). Only the Experimental Group completed the questionnaires.

2.3 Data coding

Participants’ responses in the stop perception task were recorded using Praat and coded into a binary variable response (voiced response = 0, and voiceless response = 1). In total, there were 15,600 responses. Table 1 summarizes the number of participants and responses. Three participants in the Experimental Group did not complete the post-SA session, but their data were not excluded from the analyses because mixed-effects models (see Subsection 2.4) can effectively handle a small proportion of missing data.

Table 1.

Summary of Data Coded for the Stop Perception Task.

Group	N of participants	Test sessions	N of responses
Experimental	30^a	pre-SA & post-SA	8,892
Control	24	single session	3,744
Spanish natives	19	single session	2,964

Three participants in the Experimental Group did not return for post-SA.

For the duration discrimination task, the scores were automatically generated by the testing system and then multiplied by 2.5 to get the duration discrimination threshold (Saito et al., 2020). For example, if a participant scored 30, the minimum duration difference they could perceive between two sounds was 75 ms. The Experimental Group showed an average duration discrimination threshold at 49.83 ms (SD = 40, range = 12.1 ms–215.83 ms). Because a higher score indicates lower discrimination ability, we reversed the automatically generated score using Formula (1) for statistical purposes, so that higher values can represent greater ability.

Duration discrimination score = 100 - automatically generated score

(1)

We calculated an L2 use score for each participant in the Experimental Group to represent the amount of Spanish input relative to that of Mandarin during SA. First, we added up the number of hours per day that each participant reported using Spanish and Mandarin, respectively, for various listening activities. Second, to account for individual differences in self-estimation, we calculated the L2 use scores by dividing the total number of hours listening to Spanish by that of Mandarin. This resulted in an L2 use score per participant, which allowed comparison across individuals. The Experimental Group’s mean L2 use score was at 1.20 (SD = 0.98, range = 0.31–3.67).

2.4 Statistical analyses

We evaluated the participants’ performance in two aspects: the perceptual slope and the boundary location to answer each research question. The perceptual slope measures how distinctive the pair of sounds is categorized by the listeners. Therefore, an increased slope marks more distinctive categorization, which is said to have a “crisper” categorical boundary (Morrison, 2007). We applied a Generalized Linear Mixed Model (GLMM) to the binary response variable, where the coefficient of the VOT step represents the change in log-odds of selecting “voiceless” per unit increase in VOT step. A larger slope estimate corresponds to a steeper S-shaped curve, which means listeners switch from voiced to voiceless response more abruptly, indicating a clearer categorical boundary and more decisive judgments about the L2 stop contrast.

The boundary location refers to the 50% crossover point at which participants shift their responses from voiced to voiceless. Following Casillas (2020), we calculated the boundary location in R using GLMM with the package lme4 (Bates et al., 2015). For each group and place of articulation (POA) (i.e., bilabial /p-b/, dental /t-d/, and velar /k-ɡ/), we conducted a GLMM with the binary response as the dependent variable and VOT step (−60 ms to 60 ms) as a fixed effect. The random effects structure included a random intercept for each subject with a random slope for the VOT step. Coefficients for each subject were extracted from the GLMM. The boundary location was calculated by dividing each subject’s intercept (β₀) by the estimated slope for the effect of VOT step (β_VOT) and then multiplying the result by −1, see Formula (2).

boundary location = - 1 \times \frac{β_{0}}{β_{VOT}}

(2)

We fitted a series of GLMMs to the binary-coded responses and Linear Mixed Models (LMM) to the boundary locations to answer the research questions. The significance of the fixed effects included in the GLMMs and LMMs was determined by the lmerTest() package (Kuznetsova et al., 2017). For all models, we initially specified maximal random-effects structure including random slopes for all within-unit factors by participant and by item. This structure was simplified when models failed to converge or resulted in a singular fit. The final structure for all models is presented in the Results section.

3 Results

In this section, we report the results of the statistical analysis, with each subsection addressing one RQ. To answer RQ1, which investigated the development of the Experimental Group’s perceptual categorization of Spanish stops during the SA program, we ran three sets of comparisons: (a) Chinese control versus Experimental Group at pre-SA, (b) Experimental Group at pre-SA versus Experimental Group at post-SA, and (c) Experimental Group at post-SA versus Spanish natives. To answer RQ2, which explored the influence of auditory acuity of duration and the amount of L2 use on the perceptual development of Spanish stops during SA, we only analyzed the Experimental Group’s data including pre-SA and post-SA. The individual factors included the duration discrimination score or the L2 use score.

3.1 The development of L2 perceptual categorization

Table 2 summarizes the descriptive statistics of the boundary locations for each stop contrast, as well as the average boundary location pooled across places of articulation. The inferential statistics are reported in the following subsections.

Table 2.

Means (Standard Deviations) of Boundary Location Across Consonant Pairs and Groups.

	/p-b/	/t-d/	/k-ɡ/	Average
Control Group	28.0 (3.2)	25.8 (4.0)	25.9 (4.6)	26.6 (4.1)
Experimental Group
Pre-SA	16.8 (13.9)	16.1 (11.2)	18.8 (12.0)	17.3 (12.3)
Post-SA	11.0 (32.6)	17.5 (9.0)	16.7 (10.2)	15.0 (20.4)
Spanish natives	−2.8 (3.6)	−1.7 (1.3)	4.7 (2.0)	0.1 (4.1)

3.1.1 The comparison between the Control Group and the Experimental Group at pre-SA

First, we compared the Experimental Group’s pre-SA perceptual slope with that of the Control Group. In the GLMMs, we included group (2 levels: Control Group = 0 and Experimental Group at pre-SA = 1), VOT step, and their interaction as fixed effects. Random effects included a by-participant random intercept with a random slope for VOT step. Figure 1 plots the predicted voiceless responses as modulated by VOT step.

Figure 1.

Predicted proportion of voiceless responses as a function of VOT step for the Spanish stop contrasts.

The results of the GLMM (Table 3) revealed a significant main effect of VOT step, indicating that participants’ categorization of a stop as voiced or voiceless was significantly influenced by VOT. Moreover, the significant main effect of group suggested that when VOT = 0 ms, the Experimental Group at pre-SA showed significantly higher probability of voiceless responses than the Control Group. Finally, the significant two-way interaction indicated that the perceptual slope of the Experimental Group at pre-SA was significantly shallower than that of the Control Group, suggesting that the Control Group had a sharper and more defined category boundary.

Table 3.

Summary of GLMM for the Spanish Stop Perception Task Performed by the Control Group and the Experimental Group at pre-SA.

	Fixed effects				Random effects
	Fixed effects				By participant
	Log-odds	SE	Z	p	SD
(Intercept)	−3.64	0.37	−9.94	< .001	1.61
VOT step	0.14	0.01	10.50	< .001	0.06
Group [pre-SA]	1.71	0.48	3.58	< .001	-
VOT step × Group [pre-SA]	−0.03	0.02	−2.00	.045	-

Note. Model formula: glmer(response ~ VOT step * group + (1 + VOT step | participant), family = “binomial”). Intercept = Log-odds of voiceless response of the Control Group when VOT step = 0 ms.

Significant p values are in boldface.

Second, we analyzed the boundary location. In the LMM, fixed effects included group (two levels: Control Group = 0 and Experimental Group at pre-SA = 1) and random effects included a by-participant random intercept. The LMM results (Table 4) showed a significant main effect of group, suggesting that the Experimental Group at pre-SA had significantly lower boundary locations compared to the Control Group (see Table 2 for descriptive data).

Table 4.

Summary of LMM Analyzing Perceptual Boundary Location for the Control Group and the Experimental Group at pre-SA.

	Fixed effects				Random effects
	Fixed effects				By participant
	Estimate	SE	t	p	SD
(Intercept = Control)	26.56	1.44	18.45	< .001	5.32
Group [pre-SA]	−9.31	1.93	−4.82	< .001	-

Note. Model formula: lmer(boundary ~ group + (1 | participant)).

Significant p values are in boldface.

Together, the Experimental Group at pre-SA showed less perceptual crispness (shallower slopes) and lower boundary locations compared to the Control Group. Specifically, the Experimental Group’s boundary averaged 17.3 ms and fell closer to the native Spanish average of 0.1 ms than that of the Control Group. This suggests that their performance was likely due to their years of Spanish learning rather than their knowledge of English, while the shallower slopes point to a transitional phase where new Spanish categories are forming but are not yet as robust as established native categories.

3.1.2 The analysis of the Experimental Group: comparing pre-SA to post-SA

We first assessed whether the Experimental Group changed their perceptual slope from pre-SA to post-SA. The GLMM included VOT step, session (pre-SA = 0, post-SA = 1), and their interaction as fixed effects. The random structure involved a random intercept of participant and a by-participant random slope of session. Figure 2 plots the predicted voiceless responses as modulated by VOT step.

Figure 2.

Predicted proportion of voiceless responses as a function of VOT step for the Spanish stop contrasts.

The results of the GLMM (Table 5) revealed a significant main effect of VOT step, suggesting that an increase in VOT led to significantly higher probability of voiceless responses. However, there was no significant VOT step × Session interaction, indicating that the steepness of the perceptual slope did not show significant change from pre-SA to post-SA.

Table 5.

Summary of GLMM for the Spanish Stop Perception Task Performed by the Experimental Group at Pre-SA and Post-SA.

	Fixed effects				Random effects
	Fixed effects				By participant
	Log-odds	SE	z	p	SD
(Intercept)	−0.99	0.14	−7.03	< .001	0.72
VOT step	0.06	0.00	35.27	< .001	-
Session [post-SA]	0.09	0.16	0.58	.565	0.74
VOT × Session [post-SA]	−0.00	0.00	−1.48	.139	-

Note. Model formula: glmer(response ~ VOT step * session + (1 + session | participant), family = “binomial”). Intercept = Log-odds of voiceless response of pre-SA when VOT step = 0 ms.

Significant p values are in boldface.

We then evaluated whether there was any change in boundary location from pre-SA to post-SA within the Experimental Group. The LMM was conducted with session as the fixed effect and participant as random intercept. The results (Table 6) revealed no significant main effect of session, indicating that learners’ boundary locations remained quite stable during the SA period (see Table 2 for descriptive data).

Table 6.

Summary of LMM Analyzing Perceptual Boundary Location for the Experimental Group at Pre-SA and Post-SA.

	Fixed effects				Random effects
	Fixed effects				By participant
	Estimate	SE	t	p	SD
(Intercept = pre-SA)	17.26	2.16	7.98	< .001	8.48
Session [post-SA]	−1.94	2.22	−0.87	.384	-

Note. Model formula: lmer(boundary ~ session + (1 | participant)).

Significant p values are in boldface.

Overall, we did not observe significant changes in perceptual slopes and boundary locations at the group level.

3.1.3 The comparison between the Experimental Group at post-SA and Spanish natives

To assess the differences between the L2 learners and Spanish natives after SA, we compared their post-SA performance with that of Spanish natives. For the perceptual slope, again, we ran a GLMM with the response as the dependent variable. The fixed effects included group (2 levels: Experimental Group at post-SA = 0 and Spanish natives = 1), VOT step, and their interaction. Figure 3 plots the predicted voiceless responses as modulated by VOT step split by group.

Figure 3.

Predicted proportion of voiceless responses as a function of VOT step for the Spanish stop contrasts.

The results of the GLMM (Table 7) revealed a significant main effect of VOT step, which means that an increase in VOT resulted in significantly higher probability of voiceless responses. Moreover, the significant main effect of group suggested that when VOT = 0 ms, the Experimental Group at post-SA showed significantly lower probability of voiceless response. Finally, the significant VOT step × Group interaction indicated that Spanish natives’ perceptual slope was steeper than the Experimental Group at post-SA. This means that after the SA period, learners still showed less certainty regarding categorization compared to native speakers.

Table 7.

Summary of GLMM for the Spanish Stop Perception Task Performed by the Experimental Group at Post-SA and Spanish Natives.

	Fixed effects				Random effects
	Fixed effects				By participant
	Log-odds	SE	z	p	SD
(Intercept)	−2.02	0.34	−5.93	< .001	1.65
VOT step	0.12	0.01	7.79	< .001	0.07
Group [Spanish natives]	2.00	0.52	3.87	< .001	-
VOT × Group [Spanish natives]	0.05	0.02	2.19	.028	-

Note. Model formula: glmer(response ~ VOT step * group + (1 + VOT step | participant), family = “binomial”). Intercept = Log-odds of voiceless response for the Experimental Group at post-SA when VOT step = 0 ms.

Significant p values are in boldface.

The LMM was fitted to boundary locations with group as the fixed effect and participant as a random intercept. The results (Table 8) revealed a significant main effect of group, indicating that Spanish natives showed significantly lower boundary locations than the Experimental Group at post-SA (see Table 2 for descriptive data).

Table 8.

Summary of LMM Analyzing Perceptual Boundary Location for the Experimental Group at Post-SA and Spanish Natives.

	Fixed effects				Random effects
	Fixed effects				By participant
	Estimate	SE	t	p	SD
(Intercept = post-SA)	15.03	2.21	6.80	< .001	8.44
Condition [Spanish natives]	−14.96	3.44	−4.35	< .001	-

Note. Model formula: lmer(boundary ~ session + (1 | participant)).

Significant p values are in boldface.

To summarize, at the group level, the Experimental Group showed significant differences from the native speakers in the Spanish stop categorization after SA. This holds for both perceptual slopes and boundaries.

3.2 The effects of auditory acuity and the amount of L2 input on L2 perceptual development

To answer RQ2, we analyzed the Experimental Group’s perceptual boundaries and slopes at pre-SA and post-SA. Our initial analysis plan included both duration discrimination score, and L2 use score as independent variables in an omnibus model. However, model diagnostics revealed multicollinearity among the fixed effects: most predictors exhibited variance inflation factor (VIF) values exceeding the conventional threshold of 10 (range: 11.7–36.2), with the exception of L2 use (VIF = 5.2). These results raised concerns regarding the stability of the fixed-effect estimates. We therefore decided to conduct separate analyses, as reported in the following subsections. Furthermore, all the continuous variables were standardized (z-scores) prior to statistical analyses. To answer our research question, we focus on interpreting the fixed effects associated with duration discrimination or L2 use scores, with particular attention to the highest-level interactions.

3.2.1 The role of auditory acuity of duration

First, we analyzed the perceptual slope by fitting a GLMM to the binary variable of response. The fixed effects included VOT step, session (pre-SA = −0.5, post-SA = 0.5), duration discrimination score, and all possible interactions. The results of the GLMM (Table 9) revealed a significant VOT step × Session × Duration discrimination interaction. This means that the Experimental Group’s perceptual slope (i.e., response as a function of VOT step) was modulated differently by the auditory acuity of duration at pre-SA and post-SA (Figure 4). The positive log-odds suggest that at post-SA, the voiceless response probability as a function of VOT step was more strongly associated with the participants’ auditory acuity of duration than at pre-SA. In other words, those participants with greater auditory acuity had steeper slopes (more distinctive categorization) at the post-SA session.

Table 9.

Summary of GLMM Results of the Experimental Group’s Responses of the Spanish Stop Perception Task Predicted by VOT Step, Duration Discrimination Score, and Session.

	Fixed effects				Random effects
	Fixed effects				By participant
	Log-odds	SE	z	p	SD
(Intercept)	−1.94	0.31	−6.31	< .001	1.63
VOT step	3.98	0.45	8.83	< .001	2.38
Session	0.17	0.13	1.29	.197	0.48
Duration	0.32	0.31	1.02	.306	-
VOT step × Session	−0.20	0.11	−1.79	.074	-
VOT step × Duration	−0.33	0.45	−0.72	.471	-
Session × Duration	−1.30	0.16	−8.07	< .001	-
VOT × Session × Duration	2.05	0.18	11.44	< .001	-

Note. Model formula: glmer (response ~ VOT step * session * duration + (1 + VOT step + session | participant), family = “binomial”). Intercept = log-odds of voiceless responses averaged across session when the continuous predictors are at their mean. Session is contrast-coded (pre-SA = -0.5 vs. post-SA = 0.5).

Significant p values are in boldface.

Figure 4.

Interaction plot showing the proportion of voiceless responses as a function of z-scored VOT step for the perception of Spanish stop contrasts at pre-SA and post-SA, moderated by z-scored duration discrimination score.

Second, we modeled the boundary location using LMM with session, duration discrimination score, and their interaction as fixed effects. Random structure included a by-participant random intercept. The results did not reveal any significant findings (all ps > .05, see Table 10). Therefore, individual auditory acuity of duration did not significantly predict the changes in the Experimental Group’s L2 Spanish stop boundary location from pre-SA to post-SA.

Table 10.

Summary of LMM Analyzing Perceptual Boundary Location for the Experimental Group, Predicted by Session and Duration Discrimination Score.

	Fixed effects				Random effects
	Fixed effects				By participant
	Estimate	SE	t	p	SD
(Intercept)	0.01	0.11	0.06	.955	0.50
Session	0.12	0.13	0.89	.373	-
Duration	−0.13	0.12	−1.17	.252	-
Session × Duration	−0.00	0.13	−0.03	.974	-

Note. Model formula: lmer (location ~ session * duration + (1 | participant)). Intercept = log-odds of voiceless responses averaged across session when the continuous predictor is at its mean. Session is contrast-coded (pre-SA = -0.5 vs. post-SA = 0.5).

To summarize, duration discrimination scores accounted for the changes in the Experimental Group’s perceptual slopes of Spanish stop contrasts from pre-SA to post-SA. However, this effect was not observed in perceptual boundary location.

3.2.2 The role of L2 use

The fixed effects of the models in this section were identical to those described in Section 3.2.1, except that the duration discrimination score was replaced by the L2 use score. For the sake of simplicity, we focus on interpreting the VOT step × Session × L2 use score interaction to answer our research questions.

First, we report the results of the perceptual slope (Table 11). The significant three-way interaction indicated that the perceptual slope was modulated differently by L2 use score at pre-SA and post-SA (Figure 5). Specifically, in the Experimental Group, more L2 use was associated with more reliance on VOT cues to distinguish L2 Spanish voicing contrasts. In other words, greater L2 use was associated with better perception performance after SA.

Table 11.

Summary of GLMM Results of the Experimental Group’s Responses of Spanish Stop Perception Task Predicted by VOT Step, Session, and L2 Use Score.

	Fixed effects				Random effects
	Fixed effects				By participant
	Log-odds	SE	z	p	SD
(Intercept)	−0.91	0.13	−7.29	< .001	0.62
VOT step	2.33	0.05	46.84	< .001	-
Session	0.08	0.16	0.49	.625	0.77
L2 use	0.19	0.12	1.54	.124	-
VOT step × Session	−0.16	0.10	−1.59	.113	-
VOT step × L2 use	−0.22	0.04	−4.91	< .001	-
Session × L2 use	−0.14	0.16	−0.85	.397	-
VOT step × Session × L2 use	0.31	0.09	3.46	< .001	-

Note. Model formula: glmer (response ~ VOT step * session * L2 use + (1 + session | participant), family = “binomial”). Intercept = log-odds of voiceless responses averaged across session when the continuous predictors are at their mean. Session is contrast-coded (pre-SA = -0.5 vs. post-SA = 0.5).

Significant p values are in boldface.

Figure 5.

Second, regarding the boundary location, the analyses did not reveal any significant effects (all ps > .05, see Table 12). The null results suggest that the L2 use score did not significantly predict shifts in the Experimental Group’s boundary location from the pre-SA to post-SA sessions.

Table 12.

Summary of LMM Analyzing Perceptual Boundary Location for the Experimental Group, Predicted by Session and L2 Use Score.

	Fixed effects				Random effects
	Fixed effects				By participant
	Estimate	SE	t	p	SD
(Intercept)	−0.02	0.13	−0.16	.873	0.54
Session	0.10	0.14	0.73	.470	-
L2 use	−0.02	0.13	−0.18	.859	-
Session × L2 use	0.04	0.14	0.32	.751	-

Note. Model formula: lmer (location ~ session * L2 use + (1 | participant)). Intercept = log-odds of voiceless responses averaged across session when the continuous predictors are at their mean. Session is contrast-coded (pre-SA = -0.5 vs. post-SA = 0.5).

In sum, the individual difference in L2 use score showed a pattern similar to duration discrimination scores in its relationship with the Experimental Group’s development of perceptual categorization of L2 Spanish stops. That is, the amount of L2 use was associated with the perceptual slope rather than the boundary location of L2 phonetic categories.

4 Discussion

In this longitudinal study, we focused on the development of perceptual categorization of L2 Spanish stops by L1 Mandarin listeners in an SA program. We asked (a) how L1 Mandarin listeners’ perceptual categorization of Spanish stop voicing contrasts changes after SA and (b) how auditory acuity and L2 use are associated with changes in the perceptual categorization. In what follows, we will organize our discussion to address each of the research questions.

4.1 Mandarin speakers’ development of perceptual categorization of L2 Spanish voicing contrasts

At pre-SA, the Experimental Group significantly differed from the Control Group in perceptual slopes and boundary locations. The significantly lower boundary location suggests that after around 4 years of formal learning, they had formed a specific perceptual category for Spanish voiced stops. However, the significantly shallower perceptual slope indicates that they were progressing toward target-like perceptual categorization but had not yet fully reached target-like performance. By contrast, the Control Group could only use their existing Mandarin and English phonological knowledge to perceive the Spanish stop voicing contrasts, which resulted in a more decisive perceptual slope but with a more Mandarin- or English-like boundary location. These findings align with the predictions of the SLM-r (Flege & Bohn, 2021) and previous empirical studies (Liu et al., 2019), showing that with classroom input, L2 learners can form new phonetic categories distinct from their existing linguistic systems.

At post-SA, the Experimental Group showed limited improvement from pre-SA and still significantly differed from the Spanish natives in boundary location and perceptual slope. The significantly higher boundary location compared to Spanish natives suggests that the SA experience might not be sufficiently long for measurable group-level change to take place. Therefore, with more exposure, the Experimental Group participants might shift their boundary locations. In terms of perceptual slopes, the Experimental Group did not show target-like crispness as their perceptual slope was shallower than that of the Spanish natives.

Importantly, our results do not align with previous longitudinal studies examining L1 English–L2 Spanish listeners, who showed significant improvements in their L2 Spanish stop perception (Casillas, 2020; Zampini, 1998). Casillas (2020) showed that after 7 weeks of domestic immersion, L1 English listeners reached a target-like mean boundary location of Spanish /p-b/ contrast (shifting from 0.57 ms to −0.5 ms), whereas our participants did not reach a target-like boundary location (/p-b/: shifting from 16.8 ms to 11.0 ms). One possible explanation may lie at the phonological level, which is related to the differences in laryngeal configurations between English and Mandarin. Since English is not a typical aspirating language and the voiced /b, d, ɡ/ can be phonetically realized with negative VOT, English listeners have occasional exposure to prevoicing in their L1, but Mandarin listeners do not (Chao & Chen, 2008; Duanmu, 2007; Flege, 1982; Hunnicutt & Morris, 2016; Lisker & Abramson, 1964). Consequently, the learning challenges are potentially greater for Mandarin listeners than for English listeners, as indicated by the large difference in initial boundary locations reported in other studies and found here (English listeners = 0.57 ms in Casillas, 2020 vs. Mandarin listeners = 16.8 ms for the /p-b/ contrast in the current study). After almost 1 year of SA in Spain, our participants lowered their boundary locations by 2.3 ms, which is larger than the improvement of Casillas’ (2020) participants. Although this change was not statistically significant, the resulting 15.0 ms boundary represents a distinct developmental state when compared to the 26.6 ms boundary of the novice Control Group. This suggests the formation of an intermediate category boundary, consistent with the SLM-r proposal that L2 category formation may differ from target norms due to the creation of composite L1–L2 categories (Flege & Bohn, 2021).

Overall, at the group level, the Experimental Group showed limited changes in perceiving Spanish stops from pre-SA to post-SA. Since Mandarin does not have voiced stops at either the phonological or phonetic level, L1 Mandarin learners face larger challenges than L1 English learners when learning L2 voiced stops. Therefore, previous positive findings obtained from L1 English learners of L2 Spanish cannot be generalized to learners from aspirating language backgrounds. These results underscore the critical need to broaden the scope of empirical studies beyond commonly studied L1–L2 pairings. Examining distinct phonological profiles, such as the L1 Mandarin–L2 Spanish context presented in this study, is essential for isolating specific L1 transfer effects, thereby providing a more comprehensive understanding of L2 speech development that accounts for the constraints and differences of diverse initial states.

4.2 Individual differences in the development of L2 perceptual categorization

The Experimental Group showed significantly different improvement in perceptual slopes from pre-SA to post-SA due to individual differences in auditory acuity of duration. Higher duration perception abilities were associated with greater improvement from pre-SA to post-SA, although at pre-SA, more accurate duration perceivers showed shallower perceptual slopes than less accurate ones. Previous studies observed the positive correlation between auditory acuity of duration and the perception of L2 stops using VOT as the acoustic cue (Liu, 2022). Our findings extend these to include longitudinal data showing that individual differences in auditory acuity of duration could account for the development of L2 stop categorization. Together with the positive effects of formant processing abilities on L2 vowel and liquid consonant perception (Saito, Cui, et al., 2022; Saito, Kachlicka, et al., 2022; Saito, Sun, et al., 2022), it seems that auditory acuity does play a role in L2 speech development on the fine-grained phonetic level.

The amount of L2 use was associated with the improvement of perceptual slopes from pre-SA to post-SA, with more L2 use leading to crisper perceptual categories, which is in line with previous studies on the production of L2 vowels (Turner, 2025). While previous longitudinal studies showed inconsistent results on the role of L2 use (Casillas, 2020; Turner, 2025), null results (Casillas, 2020) may have been due to the length and context of immersion. Specifically, Casillas (2020) examined perceptual changes across a 7-week domestic immersion program, while Turner’s (2025) study was comparable to our design. Specifically, Turner (2025) examined changes across 6 months in an SA context, which may have given rise to the positive findings of L2 use. Taken together, it seems that a relatively long period of SA and extensive L2 use frequency can benefit L2 speech acquisition at least for some individuals.

Interestingly, auditory acuity of duration was a far more robust predictor than the amount of L2 use for the changes in perceptual slopes of stop categorization. As we standardized the continuous predictors, we can directly compare the relative importance of the two individual difference measures across statistical models. The magnitude of the three-way interaction involving the duration discrimination score (log-odds = 2.05) was much larger than that of the interaction involving L2 use (log-odds = 0.31). While previous studies have largely focused on the input and output quantity or the quality of social interaction (see Borràs & Llanes, 2021, for a review), our findings suggest that aptitude factors (i.e., auditory acuity) may play a more decisive role than experiential factors (i.e., amount of L2 use) in L2 speech development. Therefore, SA programs should consider incorporating explicit training, helping learners with lower auditory sensitivity notice acoustic details that they might otherwise overlook from the rapid and daily L2 input.

4.3 Theoretical remarks and future directions

Our current findings add important evidence to the theoretical models of L2 speech acquisition. We show that the main challenge of learning novel phonetic categories lies in the similarity to the learners’ preexisting categories (PAM-L2, Best & Tyler, 2007) and that the likelihood that a new category will be formed is related to individual differences (Flege & Bohn, 2021; Saito, 2023). However, the learners did not show much meaningful change in the boundary locations, nor were the boundary locations affected by individual differences. As Mandarin does not have voiced consonants, the Spanish voiced-voiceless contrast is considered a SC in Mandarin according to PAM-L2 (Best & Tyler, 2007), which poses the largest challenge for learners. Notably, distinct from Zampini’s (1998) learners, our participants did not receive any explicit phonetic training on Spanish stops. Rather, any change in perceptual boundaries reflected implicit exposure. This implicit learning context requires that learners induce phonetic knowledge from daily L2 use (Conway & Christiansen, 2006; Pacton & Perruchet, 2008). However, in SA learners’ daily conversation, semantic and syntactic contexts often serve to disambiguate voiced-voiceless contrasts, reducing the functional need for learners to rely on precise VOT cues. Since communication remains successful despite non-target-like perception, learners may feel little communicative pressure to attend to the fine-grained acoustic features. This reliance on top-down processing can lead to a stabilization of their intermediate categories. Moreover, it seems that the SA program might not be sufficiently long for meaningful changes to take place on the group level, unless individual differences favor the learning or explicit training is provided.

Despite its theoretical relevance, the current study has several limitations. First, the Experimental Group included learners with varied Spanish proficiency levels (B1–C1), likely reflecting differences in prior Spanish learning experience before SA. Future research might control proficiency or examine how prior experience affects post-SA learning outcomes (Leonard & Shea, 2017). Second, participants’ hearing history was assessed through self-report at recruitment. Future studies could incorporate standardized hearing assessments to enhance methodological rigor. Third, our SA participants were master’s students whose programs were not specifically in Spanish language studies, making it very unlikely to recruit a comparable at-home control group in China. Future work could focus on Spanish majors at the undergraduate level, where students taking Spanish language courses in China can serve as an “at-home” control group. Finally, while we included all three Spanish stop contrasts in the design, distinct from most previous studies where only one stop pair was included (Casillas, 2020; Nagle, 2018; Zampini, 1998), we did not aim to compare L2 perceptual performances across POA. Therefore, we did not control the vowel following the target stops, which may also affect the perception of VOT. This made it difficult to assess POA effects in the current analysis. Future studies should consider exclusively examining the POA effect with a strictly controlled phonetic environment.

5 Conclusion

This study investigated L1 Mandarin learners’ development of L2 Spanish stop perception in an SA context while measuring how individual differences in auditory acuity of duration and L2 use affect perceptual categorization. At the group level, the Experimental Group showed limited improvement in boundary location and perceptual slope along the VOT continua. Their perceptual performance fell between that of the Mandarin–English Control Group and the Spanish Native Group. This suggests that the Experimental Group established learner-specific L2 phonetic categories. At the individual level, auditory acuity of duration and L2 use positively predicted the improvement in perceptual slope of the Experimental Group from pre-SA to post-SA, which suggests that aptitude and experiential factors strongly affect L2 perceptual learning.

In conclusion, this longitudinal study shows that the formation of L2 perceptual categories is challenging for Mandarin–Spanish learners whose L1 lacks the target contrast at both the phonetic and phonological levels. Our results highlight the importance of individual differences in new phonetic category formation over time. Therefore, our study contributes to the speech perception literature by considering a less-commonly studied language pairing, longitudinal development, and the effects of individual differences.

Footnotes

Appendix A

Table A1.

Demographic Information of the Participants in the Experimental Group.

No.	Age	YoL	AoA	Master’s subject during SA	DELE	SIELE^a
c1	22	4	18	Discourse Analysis	-	b1
c2	21	4	17	Discourse Analysis	b2	c1
c3	22	4	18	Discourse Analysis	b1	b2
c4	22	4	18	Discourse Analysis	-	b2
c5	22	4	18	Discourse Analysis	b1	b2
c6	22	4	18	Discourse Analysis	b1	b2
c7	22	4	18	Latin American Studies	b2	-
c8	23	4	19	Latin American Studies	b2	-
c9	22	4	18	Historical Studies	-	b1
c10	24	5	19	Historical Studies	-	b2
c11	22	4	18	Historical Studies	c1	-
c12	22	4	18	Discourse Analysis	b2	c1
c13	23	4	19	Discourse Analysis	-	b2
c14	22	4	18	Discourse Analysis	b1	b2
c15	22	4	18	Literature Studies	-	b1
c16	23	4	19	Journalism	b1	b1
c17	23	5	18	Latin American Studies	-	b2
c18	23	4	19	Latin American Studies	-	b2
c19	23	4	19	Latin American Studies	b2	-
c20	22	4	18	Journalism	b2	-
c21	22	4	18	Journalism	b1	b2
c22	23	5	18	Journalism	b1	b2
c23	21	4	17	Discourse Analysis	-	b2
c24	23	6	17	Journalism	-	b1
c25	22	4	18	Discourse Analysis	b1	b2
c26	22	4	18	Literature Studies	b2	b2
c27	22	4	18	Literature Studies	-	b2
c28	22	4	18	Literature Studies	b1	b2
c29	21	4	17	Latin American Studies	b2	c1
c30	22	4	18	Literature Studies	b2	-

Note. For SIELE, the lowest proficiency level among the four skills of each participant is reported. YoL = year of formal learning of Spanish; AoA = age of onset learning of Spanish; DELE = Diploma Extranjero de Lengua Española; SIELE = Servicio Internacional de Evaluación de la Lengua Española.

Instead of overall proficiency assessment, SIELE qualifies the examinee’s Spanish proficiency in each of the four skills: listening, speaking, reading, and writing.

Appendix B

Appendix C

Author contributions

Xiaotong Xi: conceptualization (equal); methodology (equal); investigation (equal); formal analysis (equal); writing—original draft (equal); writing—review and editing (equal).

Christine Shea: conceptualization (supporting); methodology (supporting); writing—original draft (supporting); writing—review and editing (equal).

Peng Li: conceptualization (equal); methodology (equal); investigation (equal); formal analysis (equal); writing—original draft (equal); writing—review and editing (equal).

Funding

The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This research is supported by the Basque Government through the BERC 2022-2025 program, by the Spanish State Research Agency through BCBL Severo Ochoa excellence accreditation [CEX2020-001010/AEI/10.13039/501100011033], by the Spanish Ministry of Science and Innovation and the Spanish State Research Agency [JDC2022-048729-I], and by the Portuguese Foundation for Science and Technology [https://doi.org/10.54499/2023.07570.CEECIND/CP2891/CT0027].

Ethical considerations

The Norwegian Agency for Shared Services in Education and Research (SIKT) and the Academic Committee of the School of Foreign Languages, Shandong University of Finance and Economics, where part of the data collection took place, approved this study.

Consent to participate

Informed consent was obtained from all individuals included in this study in written form.

ORCID iDs

Xiaotong Xi

Christine Shea

Peng Li

Data availability statement

The raw data can be obtained through this view-only link: .

Notes

References

Abrahamsson

Hyltenstam

(2008). The robustness of aptitude effects in near-native second language acquisition. Studies in Second Language Acquisition, 30(4), 481–509. https://doi.org/10.1017/S027226310808073X

Bates

Mächler

Bolker

Walker

(2015). Fitting linear Mixed-Effects Models using {lme4}. Journal of Statistical Software, 67(1), 1–48. https://doi.org/10.18637/jss.v067.i01

Best

C. T.

Tyler

M. D.

(2007). Nonnative and second-language speech perception: Commonalities and complementarities. In Bohn

O.-S.

Munro

M. J.

(Eds.), Language experience in second language speech learning: In honor of James Emil Flege (pp. 13–34). John Benjamins Publishing Company. https://doi.org/10.1075/lllt.17.07bes

Boersma

Weenink

(2017). Praat: Doing phonetics by computer [Computer software].

Borràs

Llanes

. (2021). Re-examining the impact of study abroad on L2 development: A critical overview. The Language Learning Journal, 49(5), 527–540. https://doi.org/10.1080/09571736.2019.1642941

Cabrelli Amaro

. (2012). L3 phonology: An understudied domain. In Cabrelli Amaro

Flynn

Rothman

(Eds.), Third language acquisition in adulthood (pp. 33–60). John Benjamins Publishing Company.

Carbonell

J. F.

Llisterri

(1992). Catalan. Journal of the International Phonetic Association, 22(1–2), 53–56. https://doi.org/10.1017/S0025100300004618

Casillas

J. V.

(2020). Phonetic category formation is perceptually driven during the early stages of adult L2 development. Language and Speech, 63(3), 550–581. https://doi.org/10.1177/0023830919866225

Chao

Chen

(2008). A cross-linguistic study of voice onset time in stop consonant productions. International Journal of Computational Linguistics & Chinese Language Processing, 13(2), 215–232.

10.

Conway

C. M.

Christiansen

M. H.

(2006). Statistical learning within and between modalities: Pitting abstract against stimulus-specific representations. Psychological Science, 17(10), 905–912. https://doi.org/10.1111/j.1467-9280.2006.01801.x

11.

Cook

(1999). Going beyond the native speaker in language teaching. TESOL Quarterly, 33(2), 185–209. https://doi.org/10.2307/3587717

12.

Council of Europe. (2001). Common European framework of reference for languages: Learning, teaching and assessment. Cambridge University Press.

13.

Dewaele

J.-M.

MacIntyre

P. D.

(2019). The predictive power of multicultural personality traits, learner and teacher variables on foreign language enjoyment and anxiety. In Sato

Loewen

(Eds.), Evidence-based second language pedagogy (pp. 263–286). Routledge.

14.

Díaz

B. C.

(2025). The acquisition of stop consonants in Spanish by Mandarin Chinese speakers: A cross-sectional study. [Doctoral dissertation, University of Minnesota Twin Cities]. University Digital Conservancy. https://hdl.handle.net/11299/275940

15.

Duanmu

(2007). The phonology of standard mandarin. Oxford University Press.

16.

Feng

Busà

M. G.

(2022). Acquiring Italian stop consonants: A challenge for Mandarin Chinese-speaking learners. Second Language Research, 39(3), 759–783. https://doi.org/10.1177/02676583221079147

17.

Flege

J. E.

(1982). Laryngeal timing and phonation onset in utterance-initial English stops. Journal of Phonetics, 10(2), 177–192. https://doi.org/10.1016/S0095-4470(19)30956-8

18.

Flege

J. E.

(1995). Second language speech learning: Theory, findings, and problems. In Strange

(Ed.), Speech perception and linguistic experience: Issues in cross-language research (pp. 233–277). York Press

19.

Flege

J. E.

Bohn

O.-S.

(2021). The revised Speech Learning Model (SLM-r). In Wayland

(Ed.), Second language speech learning: Theoretical and empirical progress (pp. 3–83). Cambridge University Press.

20.

Hunnicutt

Morris

P. A.

(2016). Prevoicing and aspiration in Southern American English. University of Pennsylvania Working Papers in Linguistics, 215–224.

21.

Kachlicka

Saito

Tierney

(2019). Successful second language learning is tied to robust domain-general auditory processing and stable neural representation of sound. Brain and Language, 192(January), 15–24. https://doi.org/10.1016/j.bandl.2019.02.004

22.

Kuznetsova

Brockhoff

P. B.

Christensen

R. H. B.

(2017). {lmerTest} Package: Tests in linear mixed effects models. Journal of Statistical Software, 82(13), 1–26. https://doi.org/10.18637/jss.v082.i13

23.

Ladefoged

(1998). American English. In Ladefoged

(Ed.), Handbook of the international phonetic association (pp. 41–44). Cambridge University Press.

24.

Lee

W.-S.

Zee

(2003). Standard Chinese (Beijing). Journal of the International Phonetic Association, 33(1), 109–112. https://doi.org/10.1017/S0025100303001208

25.

Leonard

K. R.

Shea

C. E.

(2017). L2 speaking development during study abroad: Fluency, accuracy, complexity, and underlying cognitive factors. The Modern Language Journal, 101(1), 179–193. https://doi.org/10.1111/modl.12382

26.

Ioannidou

Marazzina

Pericacho

Reardon

Xing

(2024). Exploring the role of personality traits in the imitation abilities of non-native speech in familiar and unfamiliar languages. Proceedings Speech Prosody, 2024, 265–269. https://doi.org/10.21437/SpeechProsody.2024-54

27.

Ioannidou

Marazzina

Pericacho

Reardon

Xing

(2026). The predictive roles of musical aptitude, auditory abilities, and working memory in L2 speech imitation: Differences between familiar and unfamiliar languages. Vigo International Journal of Applied Linguistics, 23, 115–144.

28.

(2025). Facilitative L1-transfer in nonnative sound production of monolingual and bilingual learners: Phonological overlap and L2 experience. Journal of Second Language Studies, 8(1), 58–88. https://doi.org/10.1075/jsls.00034.li

29.

Zhang

Zhao

(2020). Language History Questionnaire (LHQ3): An enhanced tool for assessing multilingual experience. Bilingualism: Language and Cognition, 23(5), 938–944. https://doi.org/10.1017/S1366728918001153

30.

Zhang

Baills

Prieto

(2024). Musical perception skills predict speech imitation skills: Differences between speakers of tone and intonation languages. Language and Cognition, 16(3), 647–665. https://doi.org/10.1017/langcog.2023.52

31.

Lisker

Abramson

A. S.

(1964). A cross-language study of voicing in initial stops: Acoustical measurements. WORD, 20(3), 384–422. https://doi.org/10.1080/00437956.1964.11659830

32.

Liu

Lin

(2021). A cross-linguistic study of L3 phonological acquisition of stop contrasts. SAGE Open, 11(1), 2158244020985510. https://doi.org/10.1177/2158244020985510

33.

Liu

(2022). Individual differences in processing non-speech acoustic signals influence cue weighting strategies for L2 speech contrasts. Journal of Psycholinguistic Research, 51(4), 903–916. https://doi.org/10.1007/s10936-022-09869-5

34.

Liu

Gorba

Cebrian

(2019). Effects of learning an additional language on VOT perception. In Calhoun

Escudero

Tabain

Warren

(Eds.), Proceedings of the 19th international congress of phonetic sciences (pp. 260–264). Australasian Speech Science and Technology Association.

35.

Martínez-Celdrán

Fernández-Planas

A. M.

Carrera-Sabaté

(2003). Castilian Spanish. Journal of the International Phonetic Association, 33(2), 255–259. https://doi.org/10.1017/S0025100303001373

36.

Mora-Plaza

Saito

Suzukida

Dewaele

J.-M.

Tierney

(2022). Tools for second language speech research and teaching [Dataset]. https://doi.org/10.17616/R31NJNAX

37.

Morrison

G.-S.

(2007). Logistic regression modelling for first and second language perception data. In Prieto

Mascaró

Solé

M.-J.

(Eds.), Segmental and prosodic issues in Romance phonology (pp. 219–236). John Benjamins Publishing Company. https://doi.org/10.1075/cilt.282.15mor

38.

Nagle

(2018). Examining the temporal structure of the perception–production link in second language acquisition: A longitudinal study. Language Learning, 68(1), 234–270. https://doi.org/10.1111/lang.12275

39.

Nagle

Zárate-Sández

(2024). The phonetics and phonology of adult L2 learners after study abroad. In Amengual

(Ed.), The Cambridge handbook of bilingual phonetics and phonology (pp. 542–559). Cambridge University Press. https://doi.org/10.1017/9781009105767.025

40.

Nardo

Reiterer

S. M.

(2009). Musicality and phonetic language aptitude. In G. Dogil & S. Reiterer (Eds.), Language talent and brain activity (pp. 213–256). Mouton De Gruyter. https://doi.org/10.1515/9783110215496.213

41.

Netelenbos

(2013). The production and perception of voice onset time in English-speaking children enrolled in a French immersion program. Interspeech, 2013, 2380–2384. https://doi.org/10.21437/Interspeech.2013-555

42.

Nowbakht

Fazilatfar

A. M.

(2019). The effects of working memory, intelligence and personality on English learners’ speaking ability. Journal of Asia TEFL, 16(3), 817–832. https://doi.org/10.18823/asiatefl.2019.16.3.4.817

43.

Ożańska-Ponikwia

Dewaele

J.-M.

(2012). Personality and L2 use: The advantage of being openminded and self-confident in an immigration context. EUROSLA Yearbook, 12(1), 112–134. https://doi.org/10.1075/eurosla.12.07oza

44.

Pacton

Perruchet

(2008). An attention-based associative account of adjacent and nonadjacent dependency learning. Journal of Experimental Psychology: Learning, Memory, and Cognition, 34(1), 80–96. https://doi.org/10.1037/0278-7393.34.1.80

45.

Roach

(2004). British English: Received pronunciation. Journal of the International Phonetic Association, 34(2), 239–245. https://doi.org/10.1017/S0025100304001768

46.

Rothman

(2015). Linguistic and cognitive motivations for the Typological Primacy Model (TPM) of third language (L3) transfer: Timing of acquisition and proficiency considered. Bilingualism: Language and Cognition, 18(2), 179–190. https://doi.org/10.1017/S136672891300059X

47.

Saito

(2023). How does having a good ear promote successful second language speech acquisition in adulthood? Introducing Auditory Precision Hypothesis-L2. Language Teaching, 56(4), 522–538. https://doi.org/10.1017/S0261444822000453

48.

Saito

Cui

Suzukida

Dardon

D. E.

Suzuki

Jeong

Révész

Sugiura

Tierney

(2022). Does domain-general auditory processing uniquely explain the outcomes of second language speech acquisition, even once cognitive and demographic variables are accounted for? Bilingualism: Language and Cognition, 25(5), 856–868. https://doi.org/10.1017/S1366728922000153

49.

Saito

Kachlicka

Suzukida

Petrova

Lee

B. J.

Tierney

(2022). Auditory precision hypothesis-L2: Dimension-specific relationships between auditory processing and second language segmental learning. Cognition, 229, 105236. https://doi.org/10.1016/j.cognition.2022.105236

50.

Saito

Sun

Kachlicka

Alayo

J. R. C.

Nakata

Tierney

(2022). Domain-general auditory processing explains multiple dimensions of L2 acquisition in adulthood. Studies in Second Language Acquisition, 44(1), 57–86. https://doi.org/10.1017/S0272263120000467

51.

Saito

Sun

Tierney

(2020). Domain-general auditory processing determines success in second language pronunciation learning in adulthood: A longitudinal study. Applied Psycholinguistics, 41(5), 1083–1112. https://doi.org/10.1017/S0142716420000491

52.

Saito

Suzukida

Tran

Tierney

(2021). Domain-general auditory processing partially explains second language speech learning in classroom settings: A review and generalization study. Language Learning, 71(3), 669–715. https://doi.org/10.1111/lang.12447

53.

Saville-Troike

Barto

(Eds.). (2017). Introducing second language acquisition (3rd ed.). Cambridge University Press. https://doi.org/https://doi.org/10.1017/9781316569832

54.

Sun

Saito

Dewaele

J.-M.

(2024). Cognitive and sociopsychological individual differences, experience, and naturalistic second language speech learning: A longitudinal study. Language Learning, 74(1), 5–40. https://doi.org/10.1111/lang.12561

55.

Sun

Saito

Tierney

(2021). A longitudinal investigation of explicit and implicit auditory processing in L2 segmental and suprasegmental acquisition. Studies in Second Language Acquisition, 43(3), 551–573. https://doi.org/10.1017/S0272263120000649

56.

Suzukida

(2021). The contribution of individual differences to L2 pronunciation learning: Insights from research and pedagogical implications. RELC Journal, 52(1), 48–61. https://doi.org/10.1177/0033688220987655

57.

Turner

(2025). The role of L2 input in developing a novel L2 contrast phonetically and phonologically: Production evidence from a residence abroad context. Second Language Research, 41(1), 103–133. https://doi.org/10.1177/02676583231217166

58.

Winn

M. B.

(2020). Manipulation of voice onset time in speech stimuli: A tutorial and flexible Praat script. The Journal of the Acoustical Society of America, 147(2), 852–866. https://doi.org/10.1121/10.0000692

59.

(2022). Spanish stops and their allophones produced by proficient Mandarin learners of Spanish. In Do

V. H.

Luong

C. M.

Nakamura

Nguyen

H. H.

Nguyen

T. M. H.

(Eds.), 25th conference of the oriental COCOSDA international committee for the co-ordination and standardisation of speech databases and assessment techniques (O-COCOSDA) (pp. 152–157). IEEE. https://doi.org/10.1109/O-COCOSDA202257103.2022.9997932

60.

(2023). Effects of stress and prominence on Spanish stops and lenition in L2 speech of proficient Mandarin learners of Spanish. In Skarnitzl

Volín

(Eds.), Proceedings of the 20th international congress of phonetic sciences (ICPhS2023) (pp. 2740–2744). GUARANT International.

61.

Zampini

M. L.

(1998). The relationship between the production and perception of L2 Spanish stops. Texas Papers in Foreign Language Education, 3(3), 85–100.

Perceptual Development of Second Language Sound Categories in a Study Abroad Program: L1 Mandarin Speakers in Spain

Abstract

Keywords

1 Introduction

1.1 Theoretical models of L2 speech acquisition

1.2 The acquisition of L2 Spanish stops

1.3 Auditory acuity and the amount of L2 use in L2 speech acquisition: the present study

2 Method

2.1 Participants

2.2 Materials and procedure

2.2.1 Stop perception task

2.2.2 Duration discrimination task

2.2.3 Questionnaires

2.3 Data coding

2.4 Statistical analyses

3 Results

3.1 The development of L2 perceptual categorization

3.1.1 The comparison between the Control Group and the Experimental Group at pre-SA

3.1.2 The analysis of the Experimental Group: comparing pre-SA to post-SA

3.1.3 The comparison between the Experimental Group at post-SA and Spanish natives

3.2 The effects of auditory acuity and the amount of L2 input on L2 perceptual development

3.2.1 The role of auditory acuity of duration

3.2.2 The role of L2 use

4 Discussion

4.1 Mandarin speakers’ development of perceptual categorization of L2 Spanish voicing contrasts

4.2 Individual differences in the development of L2 perceptual categorization

4.3 Theoretical remarks and future directions

5 Conclusion

Footnotes

Appendix A

Appendix B

Appendix C

Author contributions

Funding

Ethical considerations

Consent to participate

ORCID iDs

Data availability statement

Notes

References