Abstract
This study investigates the perception of Mandarin lexical tones and stops to examine the degree of overlap between music and language. Eighteen musicians and 21 nonmusicians participated in a typical categorical perception task. Results showed that musicians and nonmusicians had comparable degree of categorical perception of tones and stops. Compared to nonmusicians, musicians exhibited enhanced sensitivities to within-category lexical tone stimuli. However, this improved ability was not observed in the perception of stops. These findings imply that musical experience strengthens the acuity of subtle low-level acoustic variations between within-category lexical tone stimuli without interfering with the high-level phonological representations of lexical tones, and this facilitatory effect is selective and could not readily extend to stop consonants in native language.
Introduction
Music and language have shared commonalities and cognitive mechanisms underlying the processing of sounds (Patel, 2008, 2011, 2014). As a result, these common acoustic features likely lead to the transfer of musical experience to the processing of speech sounds (Besson et al., 2011a, 2011b; Chen et al., 2021; Strait & Kraus, 2011), demonstrating that musical experience facilitates the speech processing, with musicians exhibiting enhanced language abilities compared with nonmusicians (Lee & Hung, 2008; C.-Y. Lee et al., 2014; Nan et al., 2018; Tang et al., 2016; Wong et al., 2007). In addition, there is also evidence indicating that one’s experience with language can have an impact on musical perception (Deutsch, 1997; Deutsch et al., 2004). Musicians who speak a tone language manifested a more precise and stable form of absolute pitch compared with those who speak a non-tone language, possibly due to the fact that tone language speakers employ absolute pitch as a cue to differentiate the meaning of words (Deutsch et al., 2004). However, what becomes evident from phonetic investigations of tone languages is that they indeed establish phonological equivalence in a manner analogous to that observed in languages such as English. This phonological equivalence is contingent upon a proportional association with the speaker’s pitch span (Ladd, 2008). Therefore, tone delineates the relative pitch height in the presence of other tones, a speaker’s range and paralinguistic intonation, more than absolute pitch (Carter-Enyi, 2016).
Approximately two-thirds of the world’s languages are classified as tone languages (Yip, 2002), encompassing a wide range of languages from Asia, Africa, and indigenous American communities, as well as a handful of languages from Europe and the South Pacific (Maddieson, 2013), and the number of tone language speakers is growing through the growth of tone language speaking populations. For instance, Mandarin serves as a notable example, with a total of 1.119 billion speakers worldwide, comprising 920 million native speakers, and 199 million speakers of Mandarin as other language, according to Lewis et al. (2022). In a tone language, pitch serves as a vital tool for conveying meaning, emphasizing certain aspects, highlighting contrasts, and expressing a range of emotions, much like in other forms of communication. Given its significant role in music, musical training could potentially enhance the processing of tones, particularly when comparing musicians to non-musicians. Nevertheless, existing research primarily focuses on how musical experience impact the perception of lexical tones among individuals who do not speak tone languages, suggesting an improved ability to perceive speech sounds in non-native language speakers. Consequently, the enhancement of lexical tone processing through musical training remains uncertain in native speakers of tone languages, where the use of these tones is vital in day-to-day communication. Furthermore, since duration represents another shared characteristic between music and language, this prompts an investigation into the potential correlation between musical training and the enhancement of perceiving temporal features in languages. However, there is a paucity of studies investigating whether this possible musical benefit on lexical tones extends to temporal features other than pitch-related ones, particularly among individuals who speak tone languages. This study seeks to examine how musical experience affects the categorical perception of lexical tones and stop consonants in Mandarin-speaking musicians and nonmusicians to enhance our insights into how musical experience could potentially transfer across domains to influence speech perception.
Musical Experience and Speech Processing
Results of ample behavior research provide converging evidence for the beneficial impact of musical experience on speech processing (Alexander et al., 2005). Research into lexical tone perception showed that listeners with musical experience are more likely to accurately discern lexical tones compared to those who lack musical training (C. -Y. Lee et al., 2014). Gottfried and Riester (2000) reported that English-speaking musicians identified Mandarin lexical tones much better than nonmusicians, and the identification accuracy had a positive correlation with their ability to detect the direction of sine-wave glides. Additionally, musicians displayed higher accuracy and significantly faster performance in both identifying and discriminating Mandarin lexical tones. Furthermore, the facilitative effect of musical experience was found in musicians perceiving more complex lexical tone systems. English-speaking musicians demonstrated superior performance compared to nonmusicians in discriminating Thai tones (Burnham et al., 2015). However, recent studies indicated that the beneficial transfer of music to language is specific. Musical advantages extend to certain Cantonese tones, but not all of them. (Choi, 2020). All in all, evidence concurs in revealing musical advantages, though selective, in lexical tone processing.
Given that music and speech have shared acoustic properties (e.g., pitch, duration, timber), it is, therefore, likely that musical experience could also enhance the processing of duration and quality in speech sounds. However, it remains inconclusive whether this facilitative effect was specific to lexical tones or could extend to other segments. Delogu et al. (2010) examined how Italian-speaking musicians and nonmusicians discriminated Mandarin monosyllabic words. Their results demonstrated enhanced discrimination ability of lexical tones, whereas musical benefits did not extend to segments (vowels and consonants), indicating advantages of musical experience were tone-specific and did not generalize to other perceptual levels. In stark contrast, some studies showed that musical experience could enhance the processing of duration in speech sounds (Rammsayer & Altenmüller, 2006; Yee et al., 1994). Musicians have superior temporal skills and could detect subtle durational changes in comparison with nonmusicians (Yee et al., 1994). In addition, they were more accurate in processing temporal information compared with nonmusicians (Rammsayer & Altenmüller, 2006).
Moreover, the benefits of musical experience extend to other aspects. Musical experience exerts influences on phoneme processing. Musical abilities could effectively predict the phonological ability in adult English-speaking musicians, and modulated phonological processing and reading abilities in English-speaking preschoolers (Anvari et al., 2002; Slevc & Miyake, 2006).
Apart from behavioral studies, there are mounting neurobiological studies showing the facilitative music-to-language transfer effect. French-speaking musicians, with no knowledge of Mandarin tones, exhibited higher accuracy in detecting intonational and segmental variations, evidenced by earlier increased N2/N3 components for pitch-frequency variations and heightened P3b components for tones and segments (Marie et al., 2011). Moreover, Wong et al. (2007) found that among participants that do not speak Mandarin or other tone languages, musicians had more robust phonological representations of Mandarin tones in the brainstem auditory responses than non-musicians, and Musicians exhibited earlier and larger brainstem auditory responses to both speech and music compared to nonmusicians (Musacchia et al., 2007). Furthermore, musicians showed more enhanced preattentive and attentive processing of lexical tones and durational changes (Magne et al., 2006; Marques et al., 2007; Milovanov et al., 2009).
Categorical Perception as a Nuanced Paradigm
Categorical perception pertains to the phenomenon in which listeners categorize infinite and continuous acoustic signals along a continuum into distinct categories within the realm of speech perception (Liberman et al., 1957). A classical categorical perception is characterized by a sharp slope obtained in the identification task, accompanied by a corresponding peak derived from the discrimination task, situated proximate to the boundary of the categories (Abramson, 1978). Participants are exposed to a sequence of stimuli that exhibit gradual variations along a continuum, and then they are asked to identify these stimuli depending on their phonological knowledge during the identification task. They employ their psychophysical abilities and linguistic knowledge to identify whether presented auditory distinctions are the same or dissimilar in the task of discrimination. The discrimination of stimuli across categories relies on listeners’ internal phonological representations of speech sounds, while the discrimination of stimuli within the same category mainly depends on their sensitivity to subtle lower-level acoustic variations, without tapping into their higher-level phonological knowledge (Wu et al., 2015).
Considerable research has been undertaken to explore categorical perception in relation to suprasegmental aspects such as lexical tone, as well as in segmental aspects encompassing consonants and vowels within populations of native and nonnative speakers (Fry et al., 1962; Hallé et al., 2004; Liberman et al., 1957; Peng et al., 2010; Shen & Froud, 2016). These results suggested a categorical nature in consonant perception, evidenced by the alignment of a distinct category boundary with a discrimination peak positioned close to the boundary, whereas the perception of vowels remains inconclusive. Furthermore, the perception of pitch change by native speakers of Cantonese is categorical if a stimulus includes at least one contour tone (Francis et al., 2003). The research by Liberman et al. (1957) delved into the perception of synthesized stop consonants, which displayed variations across the continuum of F2 transitions. Listeners identified these arrays of variations as /b/, /d/, and /g/. Apart from the categorical perception of stop place contrasts, listeners demonstrated a categorical pattern in perceiving stop manner contrasts (Eimas et al., 1971; Xi et al., 2009). In one study, Ma et al. (2021) explored the development of categorical perception of voice onset time (VOT) and pitch in children and adults who speak Mandarin. Both children and adults were presented arrays of speech stimuli varying along the tone and VOT continua. Results showed that children aged four began to perceive the tone and VOT continua categorically. Importantly, electroencephalography studies also offered converging evidence for the findings observed in behavioral research (Shen & Froud, 2019), indicating that it is much easier to discriminate between-category stimuli compared with within-category ones.
Musical Experience and Categorical Perception
Given the benefits of music-to-language transfer in processing tones and segments, an increasing number of studies (Chen et al., 2020, 2021) have emerged further examining the extent of overlap concerning the facilitative effects, using the paradigm of categorical perception, to enhance our insights into the perceptual and cognitive mechanisms that underlie the processing of language and music. As an illustration, Wu et al. (2015) investigated categorical perception of tone continua and non-speech tone analogs in Mandarin-speaking musicians and nonmusicians. Results showed that compared to nonmusicians, musicians had enhanced sensitivities to within-category stimuli, but they had comparable boundary positions, boundary slopes as well as between-category discrimination scores, demonstrating the benefits of musical experience in perceiving low-level acoustic information, without tapping into the high-level phonological knowledge among speakers of tone languages. Moreover, musical advantages in categorical perception also extend to nonnative speakers, indicative of musical benefits in processing tones within a second language context. As an instance, Chen et al. (2021) unveiled that musicians who were native Cantonese speakers exhibited superior performance compared to nonmusicians and musicians had increased sensitivities to duration and intrinsic F0 variations. English-speaking musicians also showed enhanced categorical perception of tone continua (Chen et al., 2020). Moreover, musical experience benefits categorical perception in amateur musicians. Mandarin-speaking musicians had larger MMNs provoked by within-category deviants compared with nonmusicians, indicating that musical advantage was not limited to musical experts (J. Zhu et al., 2021). However, there are several studies yielding contradictory results, indicating brief musical training did not improve the categorical perception of the Tone 2-Tone 3 continuum. Mandarin-speaking musicians and nonmusicians exhibited comparable sensitivities to lexical tones and within-pair stimuli, implying the robustness of high-level phonological knowledge (Zhao & Kuhl, 2015b), although English-speaking musicians demonstrated improved capacities to process between-category tone stimuli compared to nonmusicians (Zhao & Kuhl, 2015a).
The Present Study
While increasing volume of studies has investigated how musical experience influences the processing of tones and the superior sensitivity to acoustic signals other than pitch in musicians, few studies addressed the issue whether higher-level phonological operations are influenced by the heightened low-level acoustic processing and whether such musical advantage extends to the categorical perception of temporal features in native tone language speakers. This research, therefore, attempted to further examine how musical experience affects the categorical perception of both pitch-related and duration-related features in native Mandarin-speaking musicians and nonmusicians. Based on prior findings, we predict that Mandarin-speaking musicians would exhibit superior sensitivities to within-category stimuli along the lexical tone continuum and this enhanced sensitivity could generalize to the VOT continuum. Furthermore, musicians are expected to exhibit similar boundary positions, boundary widths, and between-category discrimination scores as well as peakedness values.
Methods and Materials
Participants
Engaging in the identification and discrimination tasks were thirty-nine individuals who were native Mandarin speakers. They were further divided into the musician group and the nonmusician group, based on their musical experience. The musician group comprised 18 participants (Female = 13, Mage = 21, SD = 1.11), and they were defined as those who had received over 5 years (M = 6.89, SD = 2.19) of continuous musical training on their primary instruments (Chang et al., 2016). They all played pitch-related instruments like piano, and those who played percussion instruments, for example, drum, which were mostly rhyme-based, were not included in the current study. They practiced regularly and actively participated in performances in the past 5 years. The nonmusician group consisted of 21 participants (Females = 13, Mage = 21, SD = 0.98), and they were screened with the criteria that they received no formal or informal music instruction in the past 5 years (Wong et al., 2007). All participants were recruited in Zhejiang, China and they reported no history of speech, language, and hearing disorders (See Table 1). The experiments were performed with the written consent of all participants and received approval from the Ethics Committee of Taizhou University.
Musical Backgrounds of Musicians.
Stimuli
Two sets of speech continua of lexical tones and VOT were utilized in the experiment. Monosyllabic utterances /i1/ (denoting “clothes” in Mandarin) were captured by a mature male Mandarin speaker, digitized at a rate of 44,100 Hz and 16-bit precision. Grounded in the natural template of /i1/ characterized by a high-level tone, a synthesis was undertaken to construct a lexical tone continuum spanning from Tone 1 to Tone 2. This synthesis was executed utilizing the pitch synchronous overlap and add (PSOLA) technique (Moulines & Laroche, 1995) within Praat software (Boersma & Weenink, 2017), with the preservation of all other acoustic attributes. Specifically, the /i1/ template was initially normalized to 500 ms and 70 dB. Next, manual adjustments were made to the pitch contour, setting it at a steady tone of 140 Hz using Praat. A Praat script was subsequently employed to generate eleven auditory stimuli, meticulously varying along the lexical tone continuum. Each sequential pair of sounds within this continuum exhibited a frequency difference of 6 Hz. The two termini of this continuum were singled out as prototypical instances of /i1/ and /i2/ (“aunt”). Figure 1 visually depicts the configuration of these eleven stimuli along the continuum.

Schematic diagram of the Tone 1 to Tone 2 continuum.
Furthermore, the VOT continuum from /pa1/ to /pha1/ was constructed with the following procedures. First, the same adult speaker produced monosyllabic words /pha1/ (signifying “lying on one’s stomach”) and /pa1/ (representing “eight” in Mandarin), used as the natural templates for further manipulation. Then, the initial phase of /p/ within /pa1/ was incrementally substituted with the aspirated elements of /pha1/ from /pha1/ in increments of 9 ms. This process was systematically executed, yielding an 11-step VOT continuum ranging from 0 to 90 ms. This continuum was formulated through the progressive cutback and replacement technique elucidated by Winn (2020). The vowel in each of the synthesized stimuli was derived from /pa1/, with a temporal interval of 300 ms being maintained, and the other auditory attributes remained constant. The resultant endpoints of this VOT continuum were identified as archetypal instances of /pa1/ and /pha1/. The configuration of these eleven stimuli along the VOT continuum is visually displayed in Figure 2.

The wideband spectrogram of the /pa1/ - /pha1/ continuum
Procedures
All participants attended the task of completing classical tests of categorical perception of the lexical tone continuum and the VOT continuum. We counterbalanced the presentation of the two continua across participants. A subset of participants began with the lexical tone continuum, followed by the VOT continuum, while the remaining participants followed the reverse sequence. The whole process was undertaken at a self-determined pace, allowing participants the freedom to take breaks as required. This experimental procedure was conducted using ExperimentMFC in Praat.
Participants were engaged in an identification task, using a 2AFC paradigm. Before the task, they underwent a training session wherein they familiarized themselves with the process by identifying endpoints, labeled as 1 and 11. Furthermore, a practice block was conducted prior to the formal test, including Stimuli 1, 2, 5, 6, 10, and 11 for both the lexical tone continuum and the VOT continuum, played randomly twice. Participants were required to accurately identify the endpoints during the practice block, achieving an accuracy rate above 90%. Failure to meet this criterion resulted in the participants being unable to progress to the subsequent phase. In the formal testing phase, 10 instances of each of the 11 stimuli were presented randomly, resulting in 110 trials for both continua. Participants were instructed to click the “第一声” button (Mandarin’s orthographic representation of Tone 1) or “ba” (Mandarin’s Pinyin) when they perceived a high-level tone or the monosyllabic /pa1/. Similarly, they selected the “第二声” button (Mandarin’s orthographic representation of Tone 2) or “pa” (Mandarin’s Pinyin) for a high-rising tone or the monosyllabic /pha1/.
For the discrimination task, an AX paradigm was employed. Similarly, there were a training session and a practice session before the formal test. In the training session, participants were presented sound pairs composed of the endpoints (1-1, 11-11, 1-11, and 11-1) and were instructed to determine whether the sound pairs were the same or dissimilar. In the practice session, there were two occurrences of 6 pairs (5-7, 7-5, 6-8, 8-6, 1-1, and 11-11) for the lexical tone continuum and another two occurrences of 6 pairs (2-4, 4-2, 3-5, 5-3, 1-1, and 11-11) for the VOT continuum. Each participant had to meet the criteria that at least 90% of the endpoint pairs in the practice session were judged correctly. Otherwise, their data would be excluded in further data analysis. The formal test contained five occurrences of the 29 comparison pairs in the tone and VOT continuum respectively, among which 18 pairs comprised different comparison pairs with a 2-step size in either a forward sequence (1-3, 2-4, 3-5, 4-6, 5-7, 6-8, 7-9, 8-10 and 9-11) or a reverse sequence (3-1, 4-2, 5-3, 6-4, 7-5, 8-6, 9-7, 10-8, and 11-9). The remaining 11 pairs were comprised of instances where each of the 11 stimuli was paired with itself (constituting identical pairs) along the continuum. All 135 trials (29 pairs × 5 occurrences) were presented in a randomized manner, maintaining an ISI of 500 ms (Chen & Peng, 2021). They were instructed to ascertain if the sound pair they heard was identical or not, and they indicated their response by selecting either the “一样” (“same” in Mandarin) or “不一样” (“different” in Mandarin) button.
Data Analysis
Five parameters were extracted to scrutinize the influence of musical training. These parameters encompassed the boundary width, boundary position, within- and between-category discrimination scores as well as peakedness.
First, concerning the identification task, two parameters were computed. The first parameter was the boundary position, denoting the point of 50% crossover within the identification curves. The second parameter, the boundary width, was characterized as the linear span between the 25th and 75th percentiles. These metrics were calculated using Probit analysis, as described by Finney (1971).
Second, in the context of the discrimination task, the sound pairs were categorized into nine distinct comparison units (1-3, 2-4, 3-5, 4-6, 5-7, 6-8, 7-9, 8-10, and 9-11). Each of these units comprised four pairwise pairs, namely the identical pairs (AA, BB) and the differing pairs (AB, BA). Notably, adjacent comparison units contained overlapping AA or BB pairs. The computation of the discrimination score (P) for each discrimination pair was performed using the equation detailed in Xu et al. (2006).
The term P(“S”|S) signifies the percentage of “same” responses (“S”) corresponding to all instances of “same” pairs (S). Similarly, P(“D”|D) denotes the percentage of “different” responses (“D”) concerning all occurrences of “different” pairs (D). Meanwhile, P(S) and P(D) represent the probabilities of “same” and “different” trials in each comparison unit, respectively—both amounting to 50% within this study. Furthermore, for each participant, the between-category sensitivity (Pbc) and within-category sensitivity (Pwc) were calculated based on the boundary position. Pbc indicates the mean score of discrimination pairs spanning the categorical boundary, while Pwc is defined as the average score of discrimination pairs within the same category. Additionally, the peakedness (Ppk), denoting the distinction between between-category sensitivity (Pbc) and within-category sensitivity (Pwc) (Jiang et al., 2012), was also computed.
Results
Categorical Perception of Lexical Tones
Identification and Discrimination Curves
Figure 3 depicts the identification and discrimination curves of the lexical tone continuum for musicians and nonmusicians. It revealed that both groups differing in musical experience could perceive the lexical tone continuum in a categorical fashion, indicated by the sharp slope and correspondence between the discrimination peak and the boundary position.

The identification and discrimination curves of the tonal continuum for musicians and nonmusicians. The dotted line represents Tone 1 responses. The dashed line indicates Tone 2 responses. The solid line refers to discrimination curves. MM refers to Mandairn-speaking musicians. MNM refers to Mandarin-speaking nonmusicians.
Identification
Figure 4 displays the means and standard errors concerning the boundary width and boundary position. Notably, the boundary position of musicians surpasses that of nonmusicians. Conversely, the two groups manifest akin boundary widths. Through the utilization of the independent samples t-test, noteworthy disparities emerge in relation to the boundary position (t[37] = 3.688, p < .001). This implies that musicians exhibit significantly more expansive boundary positions, with their placements nearing the level end. Moreover, depicted in Figure 4 are the mean boundary widths of 1.235 for musicians and 1.471 for nonmusicians. The outcomes of the independent samples t-test reveal no substantial distinction in boundary width between groups (t[37] = −1.212, p = .233).

The means and standard errors of the boundary position and boundary width of the tonal continuum for musicians and nonmusicians.
Discrimination
Figures 5 demonstrates the means and standard errors of the within- and between-category discrimination scores, as well as peakedness for musicians and nonmusicians. An assessment through independent samples t-tests reveals no notable group differences concerning the between-category discrimination scores (t[37] = 1.548, p = .130) and peakedness (t[37] = 0.252, p = .802). This indicates a parallel level of magnitude in perceiving lexical tones categorically. However, a statistically significant group difference is apparent regarding within-category discrimination scores (t[37] = 3.109, p = .004). This suggests that relative to nonmusicians, musicians display heightened sensitivities to nuances within categories.

The means and standard errors of the between- and within-category discrimination scores and peakedness of the tonal continuum for musicians and nonmusicians.
Categorical Perception of VOT
Identification and Discrimination Curves
Figure 6 elucidates the identification and discrimination curves pertinent to the VOT continuum observed among both musicians and nonmusicians. Evident within the graph is the pronounced steepness proximate to the boundary position, alongside a distinct congruence between the boundary position and the discrimination peak. These features collectively underscore the conventional categorical perception of the VOT continuum within both groups.

The identification and discrimination curves of the VOT continuum for musicians and nonmusicians. The dotted line represents /pha1/ responses. The dashed line indicates /pa1/ responses. The solid line refers to discrimination curves.
Identification
The means and standard errors of the boundary position and boundary width corresponding to the VOT continuum for musicians and nonmusicians are illustrated in Figure 7. It can be seen that the musician group had slightly larger boundary positions and narrower boundary widths. Through the analysis of independent samples t-tests, no statistically significant disparities were observed, neither for the boundary position (t[37] = 1.062, p = .295) nor for the boundary width (t[37] = −1.176, p = .247), suggesting that both groups had comparable performance in identifying speech sounds.

The means and standard errors of the boundary position and boundary width of the VOT continuum for musicians and nonmusicians.
Discrimination
The mean values and standard errors concerning the within- and between-category discrimination scores are depicted in Figure 8. The outcomes of independent samples t-tests reveal no substantial group differences with regards to the between-category discrimination scores (t[37] = 0.907, p = .370) and within-category discrimination scores (t[37] = 0.259, p = .797). This implies that both groups exhibited a parallel level of performance in discriminating stimuli both across and within the category.

The means and standard errors of the between- and within-category discrimination scores and peakedness of the VOT continuum for musicians and nonmusicians.
Furthermore, the mean peakedness score for the musician and nonmusician groups stood at 0.203 and 0.225, respectively. An additional independent samples t-test indicated no statistically significant group distinctions (t[37] = 0.743, p = .462), affirming that both groups, irrespective of musical experience, demonstrated a similar extent of categorical perception when perceiving stops.
Influences of Instruments on Categorical Perception
To further investigate the potential effects of playing Chinese instruments or Western instruments on categorically perceiving lexical tones and stops, we divided the musicians into two distinct groups: the Western instruments group (WG) and the Chinese instruments group (CG). Several ANOVA analyzes were conducted to explore whether playing different instruments influenced perception of tones and stops. The detailed results can be found in Table 2.
Statistical Results for Lexical Tones and Stop Consonants.
Regarding lexical tones, the analysis disclosed no statistically evident disparities between the groups in relation to boundary widths, between-category discrimination scores, and peakedness. These findings suggest that the magnitude of categorical perception of lexical tones in speakers of tone languages was not affected by the type of instrument played. However, notable distinctions were observed between the groups concerning boundary positions and within-category discrimination scores. Post hoc analysis indicated significant differences between CG and nonmusicians, and marginally significant differences between WG and nonmusicians in terms of boundary positions (p = .06). Moreover, substantial differences were identified between the CG and the remaining two groups in terms of within-category discrimination scores (ps < .01). This implies that musicians playing Chinese instruments demonstrated heightened sensitivity to subtle differences in lexical tones compared to those playing Western instruments. No additional distinctions were ascertained between the groups with respect to boundary positions and within-category discrimination scores.
Regarding stop consonants, our findings indicated the absence of statistically significant distinctions among the three groups in terms of categorical perception of stops (all ps > .05, as shown in Table 2). This aligns with the aforementioned analysis, suggesting that musical training, regardless of the type of instrument, did not alter the categorical perception of lexical tones and stop consonants among native speakers of tone languages.
Discussion
Our findings indicated comparable boundary widths, between-category discrimination scores, and peakedness values between musicians and nonmusicians. indicative of typical categorical perception when processing lexical tones and VOT (Chen & Peng, 2021; Chen et al., 2017), whereas Mandarin-speaking musicians showed enhanced sensitivities to within-category lexical tone stimuli, consistent with numerous prior research (Chen et al., 2020; Sadakata & Sekiyama, 2011; Zhao & Kuhl, 2015b). Their enhanced sensitivities to within-category tone stimuli echoed earlier studies that have repeatedly revealed musicians’ heightened sensitivity to various acoustic cues, notably those pertaining to pitch information, leading to superior pitch processing ability in musicians (Marie et al., 2011, 2012). While previous research has indicated that musicians who speak a tone language are more likely to possess absolute pitch compared to those who speak a non-tone language (Deutsch et al., 2004), our study’s participants all spoke tone languages, and we did not assess their absolute pitch abilities. Consequently, we cannot definitively conclude that absolute pitch contributed to the observed enhancements in sensitivities to within-category tone stimuli. One possible explanation for the improved sensitivities is that brain’s structural and functional changes boost speech processing. Neuroimaging studies reveal that the upstream regions are responsible for acoustic information processing, while downstream areas perform high-level phonological information (Okada et al., 2010; Wessinger et al., 2001; Zhang et al., 2011). Musical experience induces structural and functional modifications in the upstream regions responsible for acoustic analysis, as opposed to exerting an impact on the downstream regions engaged in the phonological information processing (Elmer et al., 2012; Hyde et al., 2009). Therefore, this evidence converged to suggest that musical experience enhances auditory sensitivities (Kraus & Chandrasekaran, 2010), but it does not alter the internal phonological representations in Mandarin-speaking musicians (Wu et al., 2015).
Nevertheless, the current results stand in stark contrast to certain research that proposed that musicians have sharper boundary slope and improved sensitivities to both between-and within-category stimuli when perceiving nonnative speech sounds (Chen et al., 2021). They found that Cantonese-speaking musicians exhibited sharper boundary slopes and increased sensitivities to between-category stimuli and greater peakedness. This discrepancy likely results from the relative complexity of the tone system in Cantonese. The tone inventory in Cantonese is much denser compared with Mandarin, with six distinct tones in Cantonese and four in Mandarin. Research has shown that the tone density affects tone processing in a second language (M. Zhu et al. 2021). Participants who speak a language with a more complex tone system are more accurate and faster in the identification and discrimination of nonnative tones and they may process pitch variation more efficiently than those whose native language has less tones (Y.-S. Lee et al., 1996; M. Zhu et al., 2021).
Although both musical and language experience could heighten speech perception, their facilitatory effects did not cumulate, possibly due to the internal conflict between music systems and language systems and the plastic changes driven by music (Cooper & Wang, 2012; Tang et al., 2016). Driven by language experience, individuals prioritize coarse perceptual features and filtering out fine acoustic details. Musical experience might not bear much influence on the phonological representations for those with native tone language background, especially for those whose onset of music training began after the critical period for speech development. Similar to the nonmusician counterparts, Mandarin-speaking musicians frequently use pitch and VOT to distinguish meanings at the suprasegmental and segmental levels and the capacity to associate pitch or VOT shifts with semantic changes was established during the initial phases of language acquisition, leading to their comparable between-category discrimination scores and boundary widths to nonmusicians. Notwithstanding the potential enhancement of categorical perception of tones and VOTs in young children through musical training, the current study did not observe a facilitative effect in adults. 12-month musical training could enhance the categorical perception of VOT in children aged 8 to 10 years who were French speakers, particularly at the preattentive stage (Chobert et al., 2014). Furthermore, children aged 4 to 5 years old, as investigated by Yao et al. (2022), demonstrated improved categorical perception of both lexical tones and VOT following 12 months of piano training. The lack of high-level music-to-language transfer is presumably due to the absence of neuroplastic change driven by music in adults (Tang et al., 2016; J. Zhu et al., 2021). It is therefore plausible that musical experience does not boost the process any further and adult musicians do not show superior abilities to match acoustic features to long-term phonological representations.
In addition, the earlier establishment and maturation of categorical perception make it more immune to changes. Eimas et al., (1971) discovered that categorical perception was established at a surprisingly early developmental stage and this phenomenon was continuously refined in the early stage of language development (Ma et al., 2021). Prior research has indicated that Mandarin-speaking children achieve a categorical perception of lexical tones comparable to that of adults by the age of 6 (Chen et al., 2017; Feng & Peng, 2022; Ma et al., 2021; Xi et al., 2009) and achieved adult-like categorical perception of VOT at age 10 (Feng & Peng, 2022). Musical abilities might be another factor contributing to the absence of musical benefits on the magnitude of categorical perception of stops and tones. An intriguing avenue for future research would involve delving into the impact of musicians’ musical abilities on our results, accomplished through a comparative analysis of musicians exhibiting diverse levels of musical proficiency. In Chen et al. (2021), the mean onset age of musical training is 6.96, much earlier than 14.1 in our study, and their mean length of musical training is much longer than ours (16.80 vs. 6.89). In the current study, musicians might have developed robust phonological representations before the onset of musical training, which, in turn, makes it more immune to the exerted influence from musical training (Besson et al., 2011a, 2011b). As a result, around 7 years’ musical training after puberty might not change the internal phonological representations. This underscores a limitation in the degree to which musical training can augment an already well-established categorical perception among native speakers. Therefore, musical experience strengthens the acuity of subtle low-level acoustic variations between within-category lexical tone stimuli without interfering with the high-level phonological representations in the native language (Delogu et al., 2010; Wu et al., 2015).
However, it is interesting that musicians and nonmusicans had comparable performance in the categorical perception of VOT, suggesting that musical experience does not modulate the categorical perception of VOT and musical benefits in lexical tone could not generalize to VOT in adults. This runs contrary to previous findings that musical training improved active and passive perception of VOT in children, indicative of the favorable transfer of musical skills to the phonological level of language processing (Chobert et al., 2014), possibly due to the difference in neuroplastic change driven by music between adults and children (Tang et al., 2016; J. Zhu et al., 2021). More importantly, the observed asymmetrical perception of lexical tones and stops is likely due to the different intrinsic features of speech elements. The auditory short-term memory might not effectively capture the essential acoustic cues necessary to differentiate between two distinct stops within the same phonetic category, as compared to the representation of such cues for lexical tones. The observed difference could potentially arise from the temporal characteristics of the crucial information: tones, characterized by their F0 contour, tend to exhibit relatively longer durations compared to stops, particularly their voice onset time. Drawing from the cue-duration hypothesis (Pisoni, 1973), it is plausible that acoustic cues with shorter durations, which possess weaker auditory memory representations, might give rise to more pronounced categorical perception effects. Consequently, it appears that musical experience does not have the capacity to further enhance the robustness of categorical perception in the context of stop consonants. This phenomenon contributes to an uneven categorical perception effect between lexical tones and stop consonants.
Furthermore, our findings suggest that the specific instrument played, whether Chinese or Western, did not significantly impact the overall categorical perception of lexical tones and stop consonants. However, musicians playing Chinese instruments displayed enhanced sensitivity to subtle differences in lexical tones compared to musicians playing Western instruments. These results illuminate the potential influence of instruments on categorical perception and provide valuable insights for further research in this area. Nevertheless, it is important to acknowledge the limitation stemming from the small sample size of the Western instruments group, comprising only four participants. As a result, definitive conclusions regarding the impact of playing Western instruments on the categorical perception of lexical tones cannot be confidently drawn as prior studies have consistently demonstrated that individuals who play Western instruments tend to display improved speech perception abilities (Chen et al., 2020, 2021; Marie et al., 2011). Therefore, we acknowledge the need for further investigation and a cautious interpretation of these results.
Conclusion
Regarding lexical tones, our findings showed no significant group differences in relation to boundary widths, between-category discrimination scores, and peakedness. However, statistically significant differences were observed in terms of within-category discrimination scores and boundary positions. Additionally, no substantial differences were evident between musicians and nonmusicians in their categorical perception of stop consonants. It is worth noting that we cautiously observed that musicians playing Chinese instruments displayed heightened sensitivities to within-category lexical tone stimuli compared to musicians playing Western instruments.
In a nutshell, our findings indicated that musical experience strengthens the acuity of subtle low-level acoustic variations between within-category lexical tone stimuli without interfering with the high-level phonological representations in the native language, though both musicians and nonmusicians have comparable magnitude of categorical perception of lexical tones and stop consonants. Moreover, musical benefits could not readily generalize to stop consonants, indicating that the musical benefit is not of a generalized nature but rather specific to individual speech contrasts. Furthermore, considering that musicians with absolute pitch might exhibit heightened sensitivities, it is imperative to evaluate their absolute pitch abilities in order to investigate how and to what extent absolute pitch influences speech perception. This highlights the need to further elucidate the mechanism that governs music-to-language transfer in various other linguistic contrasts and musicians with longer musical experience, varying musical abilities, absolute pitch and different instruments.
Footnotes
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was funded by the Humanities and Social Science Project of Ministry of Education of China (23YJC740043); 2023 Santai Emerging Talent Special Project (23GHQ12).
Ethical Approval
The studies were reviewed and approved by the Ethics Committee of Taizhou University.
Data Availability Statement
The data of this study are available from the corresponding author upon reasonable request.
