Abstract
Emotional prosody refers to the ways in which the tone of voice can be modulated to convey emotions, feelings, and attitudes. Previous studies have explored the perception of emotional prosody and whether native speakers (L1) have an in-group advantage in recognizing the emotional prosody of their own cultural groups over non-native speakers. However, little is known about whether these findings in non-tonal languages can be generalized to tonal languages. Mandarin Chinese uses the tone of voice to encode word meanings in addition to emotional prosody. This study investigates the perception of emotional prosody in Mandarin Chinese using an emotion judgment task, focusing on the effects of emotion type (e.g., neutral, joy, anger, sadness) and syllable length (e.g., monosyllable, disyllable, trisyllable, and sentence). Three groups were included, consisting of 20 native Chinese speakers (native group), 20 L1-English L2-Chinese learners (second language group), and 20 native English speakers without Chinese learning experience (non-native group). The results revealed that all three groups can identify emotional prosody well above the chance level in Mandarin Chinese words and sentences. Moreover, the native group and the second language (L2) group showed an in-group advantage in recognizing emotional prosody compared to the non-native group, highlighting the impact of linguistic experience in addition to cultural backgrounds on the perception of emotional prosody. Notably, the effects of emotion type and syllable length differed across the three groups in terms of their perception of emotional prosody. The native group had difficulty identifying positive emotional prosody, whereas both the L2 group and the non-native group showed a pattern of improved accuracy as syllable length increased, with an interaction effect with emotion type.
Keywords
I Introduction
Emotion, such as joy, sadness, or anger, is a critical aspect of communication in our daily life (Cutler et al., 1997; Wilson and Wharton, 2006). A long-standing debate is whether the perception of emotions is universal or culturally specific (e.g., Brooks et al., 2019; Ekman et al., 1969; Elfenbein and Ambady, 2003; Gendron et al., 2018; Jack et al., 2012; Matsumoto, 1988). In early research, psychologists focused on facial expressions of emotions (e.g., Ekman and Friesen, 1986; Russell, 1994). Russell and Barrett (1999) defined prototypical emotional episodes, which include happiness, sadness, disgust, anger, fear, and surprise. Since then, a growing body of research has begun to investigate differences in emotion recognition in a cross-cultural context. Elfenbein and Ambady (2002a) conducted a meta-analysis of 97 cross-cultural studies, and proposed the in-group advantage (IGA) hypothesis: emotions can be more accurately perceived when expressed by members of one’s own cultural group, while emotions are recognized at a better-than-chance level universally. However, humans communicate emotions not only through facial expressions but also through verbal expressions, such as emotional prosody.
Emotional prosody refers to the ways in which the tone of voice can be modulated to convey emotions, feelings, and attitudes (Kemmerer, 2014). In the field of emotional prosody perception, previous research has been categorized into three types of comparisons (Paulmann and Uskul, 2014). The first type involves listeners from different cultural groups judging emotional prosody expressed by speakers from a single cultural group (e.g., Scherer et al., 2001). The second type involves listeners from a single cultural group judging emotional prosody expressed by speakers from different cultural groups (e.g., Chronaki et al., 2018; Pell et al., 2009). The third type involves listeners from different cultural groups judging emotional prosody expressed by speakers from different cultural groups (e.g., Paulmann and Uskul, 2014). In these studies, the critical manipulation is the cultural backgrounds of both emotional prosody expressors (i.e., speakers) and perceivers (i.e., listeners). When speakers and listeners belong to the same cultural group, the listeners are typically considered native speakers; otherwise, they are non-native speakers.
Previous researchers often utilized an emotion judgment task to compare the emotional prosody perception between native and non-native speakers. They asked voice actors to portray the stimuli with various types of emotional prosody so that the intended emotional prosody type for each stimulus was known to the experimenters (e.g., Beier and Zautra, 1972; Chronaki et al., 2018; Pell et al., 2009; Scherer et al., 2001; Van Bezooijen et al., 1983). In the emotion judgment task, participants listened to the auditory stimuli and judged the intended emotion for each utterance in a forced-choice identification question where a list of predefined response alternatives was given (e.g., neutral, joy, anger, sadness), and their accuracy rates in recognizing emotional prosody were then measured. The results from previous studies revealed that native speakers showed an advantage compared to non-native speakers, although both groups were capable of recognizing emotional prosody (Elfenbein, 2013; Juslin and Laukka, 2003; Laukka and Elfenbein, 2021). These empirical findings extend Elfenbein and Ambady’s (2002a) IGA hypothesis to the study of emotional prosody perception.
Few studies considered whether these findings in non-tonal languages (e.g., English) can be generalized to tonal languages (e.g., Mandarin Chinese). In Mandarin Chinese, the tone of voice can be used to differentiate lexical meaning in addition to encoding emotional prosody (Xu, 2005), and such lexical and prosodic cues coexist (Ip and Cutler, 2020; Ouyang and Kaiser, 2015) and interact (Chang et al., 2023). There are four lexical tone categories in Mandarin Chinese, namely, tone 1: high level, tone 2: rising, tone 3: fall-rising, and tone 4: falling (Yip, 2002). For example,
There have been limited attempts to explore the perception of emotional prosody in Mandarin Chinese within the framework of the IGA hypothesis, which has yielded inconsistent findings. Some researchers primarily utilized pseudo-words and pseudo-sentences, and they found that native Chinese speakers recognized emotional prosody more accurately than non-native Chinese speakers, supporting the IGA hypothesis (Cowen et al., 2019; Liu and Pell, 2012; Liu et al., 2021; Paulmann and Uskul, 2014). However, when tested with real Chinese words and sentences, Zhu (2013) found that L2 Chinese learners outperformed native Chinese speakers, contradicting the IGA hypothesis. Furthermore, since the lexical tones were not controlled in these studies, it is unclear whether these results can inform the question of whether the IGA hypothesis holds true in tonal languages. Considering the coexistence and interaction of lexical tone and emotional prosody, as well as the scarcity of research on real Chinese words and sentences, it is crucial to examine the perception of emotional prosody using real Chinese words and sentences with controlled lexical tones.
Therefore, the present study utilizes real Chinese words and sentences to investigate the perception of emotional prosody in both native (L1) and non-native Chinese speakers. Specifically, the study examines how emotional prosody is perceived in Mandarin Chinese words and sentences by three groups of speakers: native Chinese speakers (native group), L1-English L2-Chinese learners (L2 group), and native English speakers without Chinese learning experience (non-native group). Furthermore, this study explores the effects of emotion type and syllable length on Chinese emotional prosody perception and further examines whether these effects are the same or different for the three groups. This study has pedagogical implications for L2 Chinese learners and language educators, and it also provides insights into the role of emotional prosody in cross-cultural communication.
II The perception of emotional prosody
1 Effects of emotion type and syllable length
In the literature on emotional prosody perception, researchers have often attributed the observed in-group advantage to the differences between in-group members and out-group members based on their cultural backgrounds. Specifically, when both the expressor and the perceiver share the same cultural background (i.e., in-group members), their accuracy in perceiving emotional prosody is higher. In contrast, when they come from different cultural groups (i.e., out-group members), the accuracy tends to be lower. Previous studies have examined Elfenbein and Ambady’s (2002a) IGA hypothesis in various cultural contexts, and they have also shown that, in addition to cultural backgrounds, factors such as emotion type and stimuli length also have an impact on the perception of emotional prosody (Laukka and Elfenbein, 2021; Laukka et al., 2016).
First, the perception of emotional prosody is influenced by different types of emotions. Cross-cultural comparisons have revealed a negative correlation between the in-group advantage and the accuracy of emotion expressions (Elfenbein and Ambady, 2002b; Juslin and Laukka, 2003). Sauter et al. (2010) found that positive emotions, such as achievement and relief, were not recognized bidirectionally by both English and Himba listeners, whereas negative emotions such as anger and disgust were well recognized across cultures. They ascribed these differences to different social functions of positive and negative emotions: positive emotions facilitate social cohesion within in-group members which may not be shared with out-group members, whereas negative emotions are more closely linked to biological reactions and less affected by cultural learning. Laukka and Elfenbein’s (2021) meta-analysis found that, across different cultures, positive emotional prosody perception showed a greater in-group advantage between native and non-native speakers than negative emotional prosody, despite being recognized less accurately overall. However, it remains unclear how emotion type affects emotional prosody perception in tonal languages, considering that tone of voice can encode both lexical and emotional information.
In addition to the emotion type effect, syllable length also influences emotional prosody perception. Scherer (1986) first proposed that vocal emotion expressions exhibit emotion-specific acoustic patterns across different emotional states. While most studies have limited the scope of acoustic measures to f0 (e.g., Cho and Dewaele, 2021), intensity (e.g., Bachorowski and Owren, 1995), and speech rate (e.g., Koolagudi and Krothapalli, 2011), there is evidence that duration (e.g., syllable length) plays a role in the perception of emotional prosody. Blicher et al. (1990) indicated that the increase in syllable length can enhance the detectability of the tone of voice in Mandarin Chinese. Furthermore, Pell and Kotz (2011) constructed the auditory ‘gates’ by increasing the number of syllables, examining how much vocal information is needed for native English listeners to recognize basic emotions (e.g., neutral, happiness, anger, sadness). Their results showed that participants’ accuracy improved as the syllable length increased: 12.6% accuracy rate at the first syllable and 87.1% at the seventh syllable. Additionally, they also found an interaction between emotion type and syllable length in the perception of emotional prosody in English. For example, shorter utterances showed higher accuracy in recognizing specific emotional prosodies (e.g., sadness, and neutral), while longer utterances resulted in better recognition of positive emotions (e.g., happiness). In the studies of Chinese emotional prosody perception, previous researchers either used sentences (Liu and Pell, 2012; Paulmann and Uskul, 2014) or did not control for the syllable length (Lin et al., 2020; Zhu, 2013). Thus, it is unknown whether emotional prosody in Chinese words and sentences can be both perceived by native and non-native Chinese speakers, and to what extent syllable length influences emotional prosody perception, and whether this effect of syllable length interacts with emotion type on their perception.
2 Second language experience effect
The existing studies have mainly compared the perception of emotional prosody between native speakers and non-native speakers without L2 learning experience, focusing on cultural backgrounds while ignoring the potential effects of linguistic experience. This leads to further questions regarding the role of linguistic experience in emotional prosody perception and, more specifically, how non-native speakers with language learning experience (i.e., L2 learners) perceive emotional prosody in Mandarin Chinese. In the field of second language acquisition (SLA), three main inquiries have been raised regarding the perception of emotional prosody within the framework of the IGA hypothesis.
The first question is to what extent second language learners can perceive emotional prosody in their second language. Alm and Llorà (2006) found that L1-Swedish L2-English learners and L1-Spanish L2-English learners can distinguish different emotional prosodies in L2 English even in a one-word utterance. Wei et al. (2022) showed that L1-Chinese L2-German learners can recognize emotional prosody in disyllabic German words with an above-chance-level accuracy. In addition, multiple studies indicated that L2 learners can recognize emotional prosody in sentences (e.g., Altrov, 2013; Bhatara et al., 2016; Dromey et al., 2005; Zhu, 2013). The results demonstrated that L2 learners are capable of accurately recognizing emotional prosody in their L2 at both word and sentence levels.
Building upon the findings of L2 learners’ abilities to perceive emotional prosody, the second question is whether native speakers maintain an in-group advantage compared to L2 learners. Some researchers believe that native speakers have an in-group advantage of emotional prosody perception compared to L2 learners. For example, Altrov (2013) found that native Estonian speakers can recognize Estonian emotional prosody better than L1-Russian L2-Estonian learners. Similarly, Graham et al. (2001) showed native English speakers outperformed both L1-Japanese L2-English learners and L1-Spanish L2-English learners in perceiving English emotional prosody. Other researchers claim there are no significant differences between native speakers and L2 learners in terms of emotional prosody perception. For example, Dromey et al. (2005) reported no differences in English emotional prosody perception between native English speakers and L2 English learners. Min and Schirmer (2011) also found that the performance of emotional prosody perception was comparable between native English and L1-Chinese L2-English speakers. However, the most surprising results were found in a tonal language, where L2 Chinese learners outperformed native Chinese speakers in Chinese emotional prosody perception. Zhu (2013) observed that L1-Dutch L2-Chinese learners recognized emotional prosody in Mandarin Chinese more accurately than native Chinese speakers, and native Dutch speakers without L2 Chinese learning experience recognized emotional prosody in Mandarin Chinese as well as native Chinese speakers. Zhu further interpreted these unexpected findings in Chinese emotional prosody perception in light of the differences in the mechanisms of processing tone of voice between native and non-native Chinese speakers. Specifically, as speakers of tonal languages, native Chinese speakers tend to prioritize the linguistic function of the tone of voice (e.g., lexical tone) over its paralinguistic role (e.g., emotional prosody), resulting in less accurate recognition of paralinguistic cues (e.g., emotional prosody) compared to speakers of non-tonal languages. In addition, such perception differences between native speakers and L2 learners have been found to interact with emotion type in the perception of emotional prosody. For example, Paone and Frontera (2019) found that native Italian speakers showed comparable performance with L1-Russian L2-Italian speakers in identifying negative emotional prosodies such as anger and sadness, while native Italian speakers showed an in-group advantage at recognizing positive emotional prosody such as joy compared to L2 learners.
Given the inconsistent findings in the field of SLA, the third question is to determine whether second language experience facilitates or interferes with the perception of emotional prosody. Comparing emotional prosody perception among non-native speakers with different levels of L2 learning experience, some studies have revealed that the L2 learning experience can contribute to L2 learners’ perception of emotional prosody in their second languages. For example, Zhu (2013) found that native Dutch speakers with L2 Chinese learning experience outperformed those without L2 Chinese learning experience in the perception of Chinese emotional prosody. Similarly, Shochi et al. (2016) found that native French speakers with more L2 Japanese learning experience were able to recognize Japanese emotional prosody more accurately compared to those with less L2 Japanese learning experience. On the contrary, other researchers have argued that one’s second language experience may interfere with their perception of emotional prosody. For instance, Bhatara et al. (2016) found that L1-French L2-English learners with higher English proficiency were less accurate in recognizing positive emotional prosody in English compared to those with lower English proficiency. They argued that the interference effect may arise from semantics, as L2 learners with higher English proficiency may have focused more on the lexical meaning of the sentence rather than its emotional prosody, compared to those with lower English proficiency.
While previous literature demonstrates that L2 learners are capable of perceiving emotional prosody in their L2, there is still no consensus on whether native speakers have an in-group advantage over L2 learners and whether the L2 learning experience enhances or hinders L2 learners’ emotional prosody perception. Additionally, it is important to note that not only is the systematic teaching of emotional prosody vastly neglected in L2 classrooms (Lengeris, 2012), but also the available curriculum and study materials (which are predominantly emotion-neutral do not teach L2 learners to perceive emotional prosody in their L2 (Dewaele, 2005; Kaneko and Yamane, 2022). Therefore, further investigation into emotional prosody perception in SLA is crucial to address these inadequacies and will have pedagogical implications for both L2 learners and language educators.
3 Some methodological limitations in previous research
The field of SLA has witnessed a growing body of research that investigates how L2 learners perceive paralinguistic information, such as emotional prosody. However, a closer examination of previous studies reveals some limitations in the experimental design. One noticeable limitation is the lack of inclusion of all three groups of speakers: native speakers (native group), non-native speakers with L2 learning experience (L2 group), and non-native speakers without L2 learning experience (non-native group). Most studies have limited their comparisons to two of the three groups of speakers: native group vs. L2 group (e.g., Altrov, 2013); native group vs. non-native group (e.g., Paulmann and Uskul, 2014); L2 group vs. non-native group (e.g., Shochi et al., 2016). Including all three groups of speakers would elucidate the effects of cultural backgrounds and linguistic experiences on the perception of emotional prosody.
Furthermore, there has been a scarcity of studies investigating emotional prosody perception in tonal languages. In Mandarin Chinese, Zhu (2013) examined emotional prosody perception among three groups of speakers (i.e., native, L2, and non-native groups), yielding the unexpected results that L1-Dutch L2-Chinese learners showed higher accuracy in recognizing Chinese emotional prosody than native Chinese speakers. However, Zhu’s experimental design is problematic in three ways. First, Zhu did not consider the lexical tone effect on emotional prosody perception. She not only used phrases such as
To sum up, in the field of SLA, several studies have explored the perception of emotional prosody between native speakers and L2 learners using Elfenbein and Ambady’s (2002a) IGA hypothesis as a framework. While L2 learners have shown to be capable of recognizing emotional prosody, the findings from previous research have been inconsistent (Altrov, 2013; Dromey et al., 2005; Graham et al., 2001; Min and Schirmer, 2011; Zhu, 2013). Moreover, one previous study compared emotional prosody perception among native, L2, and non-native groups in a tonal language (Zhu, 2013), but it did not control for confounding factors (e.g., lexical tone, syllable length, and semantic valence). Hence, the present study investigates how L2 Chinese learners perceive emotional prosody in Chinese words and sentences, and whether having L2 Chinese learning experience improves non-native speakers’ perception of emotional prosody in Mandarin Chinese.
III The current study
In light of previous research, the current study extends the psycholinguistic account of emotional prosody perception to the field of SLA specifically in a tonal language. This study investigates how native Chinese speakers (native group), L1-English L2-Chinese learners (L2 group), and native English speakers without Chinese learning experience (non-native group) perceive emotional prosody in Mandarin Chinese words and sentences within the framework of the IGA hypothesis (Elfenbein and Ambady, 2002a). Furthermore, the study explores the effects of emotion type (neutral, joy, anger, and sadness) and syllable length (monosyllable, disyllable, trisyllable, and sentence) on emotional prosody perception in Mandarin Chinese. Therefore, the present study addresses the following research questions:
Research question 1: Does the In-Group Advantage (IGA) hypothesis hold true in Mandarin Chinese words and sentences? a. Does the native group show an advantage in recognizing emotional prosody in Mandarin Chinese over the non-native group? b. Does the native group show an advantage in recognizing emotional prosody in Mandarin Chinese over the L2 group? c. Does the L2 group show an advantage in recognizing emotional prosody in Mandarin Chinese over the non-native group?
Research question 2: To what extent do emotion type and syllable length affect emotional prosody perception in Mandarin Chinese among the three groups?
We have made the following predictions in accordance with each research question. First, if the IGA hypothesis stands true in Mandarin Chinese (Elfenbein and Ambady, 2002a), we predict an effect of group such that the native group would have an advantage in recognizing emotional prosody over the non-native group. However, considering the inconsistent findings in previous studies, it remains unclear if the native group would maintain an advantage in recognizing emotional prosody compared to the L2 group (Paulmann and Uskul, 2014; Zhu, 2013); and if the L2 group would have an advantage in recognizing emotional prosody over non-native group (Bhatara et al., 2016; Shochi et al., 2016; Zhu, 2013). Moreover, we predict an effect of emotion type such that negative emotional prosody will be perceived more accurately compared to positive emotional prosody in Mandarin Chinese (Laukka and Elfenbein, 2021; Sauter et al., 2010). We also anticipate both an effect of syllable length such that the accuracy of emotional prosody perception improves as the syllable length increases, and an interaction between syllable length and emotion type in the perception of Chinese emotional prosody (Pell and Kotz, 2011). Additionally, we anticipate an interaction between emotion type and group (Bhatara et al., 2016; Paone and Frontera, 2019) in Mandarin Chinese words and sentences.
IV Methods
1 Participants
Based on a closely related study (Zhu, 2013), a total of 60 participants were included in the analysis: 20 native Chinese speakers (native group: 10 male, 10 female; mean age = 24.7; SD of age = 2.45; age range = 22–30), 20 L1-English L2-Chinese learners (L2 group: 7 male, 13 female; mean age = 19; SD of age = 0.65; age range = 18–22), and 20 native English speakers without Chinese learning experience (non-native group: 4 male, 16 female, mean age = 21.1; SD of age = 1.74; age range = 19–24). 1 At the time of their participation, all native Chinese speakers were in China and indicated Mandarin Chinese as their native language. All native English speakers were in the United States and indicated English as their native language. All L2 Chinese learners were enrolled in their second semester of Mandarin course (mean L2 Chinese learning experience = 6.8 months) 2 at a public US university, and no L2 Chinese learners were heritage speakers of Mandarin Chinese or any other tonal language. All participants had normal hearing. All participants were tested remotely online and received class credit or $10 for their participation. All aspects of the study were approved by the Institutional Review Board (IRB) of the first author’s university.
2 Stimuli
We adapted from Shen (1985) to create the word and sentence stimuli with controlled lexical tones and neutral semantic valence. 3 We selected words and sentences that were not typically found in L2 learners’ textbooks to minimize the influence of semantic knowledge on their judgments of emotional prosody. Given the prevalence of relatively simple syllable structures, with only approximately 400 distinct syllables in Chinese (Duanmu, 2007), such construction of the stimuli allows for a comparison among the three groups: the native group (familiar with both phonology and semantics), the L2 group (familiar with phonology but not semantics), and the non-native group (unfamiliar with either semantics or phonology).
Furthermore, based on previous research (Paulmann and Uskul, 2014; Zhu, 2013), we manipulated the syllable length and emotion type of the stimuli, ensuring a similar distribution of four lexical tone categories across different syllable lengths and emotion types. Specifically, to explore the effect of syllable length, we included monosyllables, disyllables, trisyllables, and sentences. To probe into the effect of emotion type, we asked a professional female voice actress to record all the words and sentences in four types of emotional prosody: neutral, joy, anger, and sadness. After the collection of sound files, we used Praat (Boersma and Weenink, 2023) to segment the recorded utterances. In addition, we asked six native Chinese speakers to validate these recorded utterances by classifying the emotional prosody of each utterance in a four-alternative forced-choice format, and we only used the utterances that received unanimous agreement in the current experiment (144 out of 288 utterances). Thus, there were 144 stimuli (i.e., 16 monosyllabic words, 64 disyllabic words, 48 trisyllabic words, and 16 sentences) in the emotion judgment task. Table 1 provides examples of these stimuli, and Table 2 presents the acoustic parameters of the stimuli across syllable lengths and emotion types.
Example stimuli of Chinese words and sentences.
The means and standard deviations (in parentheses) of three acoustic parameters of the stimuli.
In the emotion judgment task, the stimuli were presented in four blocks: monosyllable block, disyllable block, trisyllable block, and sentence block. A cross-block Latin square design was used to counterbalance the presentation order of blocks, and thus four versions of the emotion judgment task were created using Qualtrics survey. Furthermore, within each block, the order of stimuli with different emotion types was also counterbalanced using a Latin square design. Additionally, six filler utterances were used in the experiment to check participants’ attention.
3 Procedure
In this study, participants first completed the language background questionnaire and were randomly assigned to one version of the online emotion judgment task using Qualtrics in their respective native languages. In the language background questionnaire, participants were asked to provide information about their native languages and L2 Chinese learning experience (if any) prior to the emotion judgment task. The emotion judgment task was self-paced, and the participants were instructed to listen to a series of utterances, one at a time, and then judge the intended emotional prosody for each utterance in a four-alternative forced-choice format (i.e., neutral, joy, anger, and sadness). Participants’ responses from the language background questionnaires and emotion judgment tasks were recorded. After the emotion judgment task, we asked L1-English–L2-Chinese learners to report if they knew the meanings of the target words and sentences used in the experiment. The post-experiment reports showed that L2 learners only had limited knowledge of the semantics of the stimuli, suggesting that semantics had little influence on emotional prosody perception for the L2 group.
4 Analysis
In the emotion judgment task, the total number of trials in data analysis was 8,640 (2,880 from the native group, 2,880 from the L2 group, and 2,880 from the non-native group). For each trial, participants’ judgments of emotional prosody were recorded and collected using Qualtrics. Participants received a score of ‘1’ if they recognized the emotional prosody correctly, as their judgment matched the intended emotional prosody of the utterance; they received a score of ‘0’ if their judgment mismatched the intended emotional prosody of the utterance. The raw scores (coded as 1 and 0) were averaged across the participants to calculate their accuracy rate.
Moreover, to test the IGA hypothesis in Mandarin Chinese, a logistic mixed-effects model (Jaeger, 2008) was conducted using the
V Results
1 Descriptive statistical results
In Figure 1, the confusion matrix shows that both native and L2 groups had higher overall accuracy rates than that of the non-native group (native group: 94.7%, L2 group: 95.9%; non-native group: 78.7%). Even with a lower accuracy rate, the non-native group’s accuracy rate was still well above the chance level. The native and L2 groups showed higher accuracy rates than the non-native group across four emotion types.

Confusion matrixes and mean accuracy rates (%) of emotional prosody judgments in three groups.
A notable observation in Figure 1 is that, in the ‘joy’ condition, the native group showed a lower accuracy rate compared to the L2 group (native group: 89%; L2 group: 95%). Further analysis of error patterns revealed that native Chinese speakers had a higher tendency to mistake the emotion of ‘joy’ for ‘neutral’, with 38.8% (59 errors out of a total of 152 errors) of their errors involving this specific misjudgment. However, the same pattern for the L2 group and the non-native group accounted for only 10.1% and 21.2% of their errors respectively, which showed that they were less likely to confuse ‘joy’ for ‘neutral’.
Moreover, Figure 2 illustrates the mean accuracy rates for three groups (native group, L2 group, and non-native group) across four emotion types (neutral, joy, anger, and sadness) and four syllable lengths (monosyllable, disyllable, trisyllable, and sentence). The native and L2 groups consistently outperformed the non-native group in all four emotion types and syllable lengths.

Mean accuracy rates of emotional prosody judgments across four emotion types and syllable lengths in three groups. The black vertical lines show the standard error.
Interestingly, as shown in Figure 2, in the ‘joy’ condition, the native group showed a lower accuracy rate compared to the L2 group, particularly in the ‘monosyllable’ condition (native group: 71.3%; L2 group: 96.3%). However, for the other three emotion types (i.e., neutral, anger, and sadness), the native and L2 groups had comparable accuracy rates.
In addition, Figure 3 shows the interaction between emotion type and syllable length on the accuracy of emotional prosody perception for the three groups. In each group, the mean accuracy rates of disyllables, trisyllables, and sentences were higher than the monosyllables. The mean accuracy of the three participant groups was 80.7% in the monosyllable condition, and 89.7%, 91.6%, and 93.5% in the disyllable, trisyllable, and sentence conditions, respectively.

Plots of interaction between emotion type and syllable length for three groups in terms of mean accuracy rate.
As shown in Figure 3, the accuracy rate in the monosyllable condition (represented by the light blue line) shows a larger fluctuation compared to other syllable length conditions for all three groups. While all three groups had the lowest accuracy rate in the monosyllable condition, the specific emotion type associated with this lowest accuracy varied across three groups: the native group had the lowest accuracy in the ‘joy’ condition, the L2 group showed the lowest accuracy in the ‘neutral’ and ‘sadness’ conditions, and the non-native group exhibited the lowest accuracy in the ‘neutral’ condition.
2 Inferential statistical results
As shown in Table 3, there was an effect of group: both the native group (β = 0.660,
Mixed-effects logistic regression model for the accuracy of the emotional prosody perception in three groups: native group, L2 group, and non-native group.
To address research question 1 of whether the IGA hypothesis holds in Mandarin Chinese, we used Tukey’s test from the
Post-hoc analysis results comparing the mean accuracy for three participant groups.
Furthermore, significant effects and interactions in the omnibus model warranted us to conduct separate analyses for each group to address research question 2, namely, to what extent emotion type and syllable length affect emotional prosody perception. As shown in Table 5, for the native group, the emotional prosody of ‘joy’ was recognized less accurately (β = −1.255,
Mixed-effects logistic regression model for the accuracy of emotional prosody perception in the native group.
Pairwise comparisons for the accuracy of emotional prosody perception in the native group.
For the L2 group, as shown in Table 7, emotional prosody in ‘monosyllable’ was recognized less accurately compared to the grand mean (β = −0.897,
Mixed-effects logistic regression model for the accuracy of emotional prosody perception in the second language (L2) group.
Pairwise comparisons for the accuracy of emotional prosody perception in the second language (L2) group.
For the non-native group, as shown in Table 9, the emotional prosody of ‘joy’ (β = −0.519,
Mixed-effects logistic regression model for the accuracy of emotional prosody perception in the non-native group.
Pairwise comparisons for the accuracy of emotional prosody perception in the non-native group.
VI Discussion
In this study, we investigated emotional prosody perception in Mandarin Chinese for three groups of speakers (native group, L2 group, and non-native group) across four emotion types (neutral, joy, anger, sadness) and four syllable lengths (monosyllable, disyllable, trisyllable, and sentence) using an emotion judgment task within the framework of In-Group Advantage Hypothesis (Elfenbein and Ambady, 2002a). The study contributed to the existing literature on emotional prosody perception in tonal languages by utilizing real Chinese words and sentences as stimuli while manipulating the effects of emotion type and syllable length. Furthermore, our study extended the psycholinguistic account of emotional prosody perception to the field of second language acquisition, providing insights into how L2 learners perceive paralinguistic information, such as emotional prosody, in their second language.
Overall, our study indicated that native Chinese speakers (native group) and L1-English L2-Chinese learners (L2 group) recognized emotional prosody in Mandarin Chinese at a very high accuracy rate (native group: 94.7%; L2 group: 95.9%). In contrast, native English speakers without Chinese learning experience (non-native group) recognized emotional prosody in Mandarin Chinese less accurately (non-native group: 78.7%) but still well above the chance level. These results showed that although the non-native group demonstrated the ability to perceive emotional prosody in an unfamiliar tonal language at both the word and sentence levels, the native group had an in-group advantage in recognizing emotional prosody in Mandarin Chinese words and sentences compared to the non-native group. The findings provide support for Elfenbein and Ambady’s (2002a) IGA hypothesis in the context of tonal languages.
In addition to the in-group advantage demonstrated by the native group, we also found that the L2 group, who had only a short period of Chinese language learning, was able to recognize Chinese emotional prosody more accurately than the non-native group, even though both groups belonged to the same cultural group (i.e., native English speakers). Our results indicated that the L2 group showed an advantage in recognizing emotional prosody over the non-native group in Mandarin Chinese words and sentences. This finding can be explained by the phonological familiarity gained through L2 Chinese learners’ linguistic experience. Notably, Mandarin Chinese features a relatively small set of distinct syllables (approximately 400) and just over 1,300 unique syllable-tone combinations (Duanmu, 2007). Therefore, despite their limited experience with the Chinese language, L2 Chinese learners may have already gained a certain degree of phonological familiarity with many syllables and syllable-tone combinations. This phonological familiarity has been shown to improve linguistic processing for L2 learners (e.g., Kaushanskaya et al., 2013; Liu and Wiener, 2020). Our findings suggest this facilitation effect of linguistic experience can be extended to paralinguistic processing, thereby potentially compensating for L2 learners’ disadvantages of not being a native speaker in their perception of emotional prosody.
Taken together, our study revealed that individuals with linguistic experience, including both native group and L2 group, outperformed those without such experience (non-native group) in the perception of emotional prosody, which aligned with previous studies’ findings (Paulmann and Uskul, 2014; Zhu, 2013). However, Elfenbein and Ambady’s (2002a) IGA hypothesis only predicts an in-group advantage based on cultural backgrounds where native speakers (culturally in-group members) have an in-group advantage over non-native speakers (culturally out-group members), but they do not explicitly address what emotional prosody perception looks like for L2 learners. This raises an important question in the framework of the IGA hypothesis, that is, how should we define and measure ‘in-groupness’ when including L2 learners in studies? While cultural background is indeed a contributing factor to the in-group advantage, it is not necessarily the only one. Our results found that native English speakers with L2 Chinese learning experience demonstrated significantly better perception of Chinese emotional prosody compared to those without L2 Chinese learning experience. This finding highlights the pivotal role of the second language experience in shaping emotional prosody perception, alongside cultural background. Therefore, we suggest that future research considers both cultural background and language experience when investigating emotional prosody perception involving non-native learners.
Interestingly, our study found that L2 Chinese learners showed a comparable performance with native Chinese speakers in perceiving emotional prosody in Mandarin Chinese words and sentences, consistent with prior studies (Dromey et al., 2005; Min and Schirmer, 2011). We also found that L2 Chinese learners recognized positive emotional prosody (i.e., joy) more accurately than native Chinese speakers particularly in monosyllabic words (native group: 71.3%; L2 group: 96.3%). This finding aligns with previous findings (Zhu, 2013) and indicates an interaction between emotion type and linguistic experience in a tonal language. Our results can be explained by the precedence of tone of voice as linguistic cues over paralinguistic cues among tonal language speakers (Zhu, 2013) coupled with an asymmetric perception of emotional prosody (Laukka and Elfenbein, 2021). Neural studies indicated that native Chinese speakers, as tonal language speakers, exhibited greater sensitivity to the task-irrelevant linguistic cues (Liu et al., 2015) and experienced more interference from lexical tones in speech perception (Yu and Zhang, 2018) compared to non-tonal language speakers. Meanwhile, prior studies revealed a notable asymmetry where negative emotional prosody is generally more readily identified than positive emotional prosody (Laukka and Elfenbein, 2021; Liu and Pell, 2012). Negative emotional prosody often serves as warning signals, thus evolving to be more distinct and recognizable (Sauter et al., 2010), whereas positive emotional prosody is usually perceived across multiple channels alongside contextual meanings or facial expressions (Chang et al., 2023; Pell et al., 2009). Therefore, in our study, native Chinese speakers, as tonal language speakers, may be more susceptible to task-irrelevant linguistic cues (i.e., lexical tones), receiving more interference from lexical tones in perceiving positive emotional prosody compared to L2 Chinese learners. However, due to the innate salience of negative emotional prosody, native Chinese speakers may experience minimal interference from lexical tones, resulting in high accuracy in identifying negative emotional prosody (mean accuracy of anger and sadness = 96%), similar to L2 Chinese learners.
Another possible explanation is the influence of semantics on the perception of emotional prosody. Recent research shows that the semantic valence of the stimuli and the semantic knowledge of the participants can affect emotional prosody perception for both native and L2 speakers. For example, Cho and Dewaele (2021) indicated that semantic valence facilitated the perception of English emotional prosody for native and L2 English speakers in an emotion-congruent condition. Bhatara et al. (2016) found the semantic knowledge of participants interfered with emotional prosody perception for L2 English learners. Ben-David et al. (2016) discovered that semantics had an impact on the perception of emotional prosody for native English speakers, even if it is task-irrelevant. Importantly, recent studies in Mandarin Chinese found a semantic-prosody congruency effect on the perception of Chinese emotional prosody for native and L2 Chinese speakers (Lin et al., 2020; Xiao and Liu, 2025). In our study, although we controlled the semantic valence of the stimuli, L2 Chinese learners had limited semantic knowledge of the stimuli, whereas native Chinese speakers did know the meanings of the stimuli. This semantic knowledge interfered with emotional prosody perception, manifesting in specific error patterns: native Chinese speakers were more likely to mistake ‘joy’ as ‘neutral’ with a confusion rate of 38.8%, leading to significantly lower accuracy in their perception of positive emotional prosody; in contrast, the confusion rate for L2 Chinese learners was only 10.1%. Our research findings indicated that native Chinese speakers were more biased by semantics and thus confused the emotional prosody of ‘joy’ with ‘neutral’, especially when the encoded prosodic information was subtle and limited (e.g., ‘joy’ in monosyllables). Conversely, L2 Chinese learners had limited semantic knowledge of Chinese words and sentences and thus experienced less semantic interference in their perception of emotional prosody. As a result, L2 learners may have focused more on prosodic cues rather than semantic cues in stimuli, enabling them to recognize positive emotional prosody better than native speakers.
Just as there is an effect of emotion type, syllable length also has an impact on emotional prosody perception in Mandarin Chinese. Few studies have specifically investigated the effect of syllable length on emotional prosody perception in tonal languages. In the study, we discovered that native Chinese speakers, L1-English L2-Chinese speakers, and native English speakers without Chinese learning experience can perceive emotional prosody above chance level in Chinese words and sentences, and the recognition of emotional prosody improved as syllable length increased for all three groups. Furthermore, there were group differences in perception of emotional prosody: the native group showed the lowest accuracy in recognizing ‘joy’ in the monosyllable condition, while for the L2 and non-native groups, the lowest accuracy was associated with recognizing the ‘neutral’ emotion in the monosyllable condition. Evidence from cognitive neuroscience has shown that the brain state in ‘neutral’ serves as a central hub within the network of emotions (Kragel et al., 2022). Thus, we speculate that this central role of recognizing neutral emotions could present greater challenge for non-native speakers (both L2 and non-native groups) in establishing a baseline of emotion perception in an unfamiliar or second language, especially when emotional information is limited (e.g., monosyllables).
The current study was not without limitations. First, the experiment was conducted remotely. In future studies, it is essential to control the acoustic environments during emotional judgment tasks since the remote setup may result in diverse perception environments (Yan et al., 2022). Second, previous studies have reported significant gender and age effects on emotional prosody perception (e.g., Hunter et al., 2010; Lin et al., 2021a, 2021b; Sen et al., 2018). Although our results showed no significant effects of gender and age across groups, the imbalance of gender representation in the non-native group raises a potential concern. Future studies can manipulate the gender and age factors, exploring potential interactions with linguistic experience in the perception of emotional prosody. Furthermore, the current study found that native Chinese speakers with semantic knowledge of the stimuli showed a lower accuracy in the perception of positive emotional prosody compared to the L2 Chinese learners without such semantic knowledge. It would be interesting to examine how semantics influence emotional prosody perception when both native and L2 speakers have the semantic knowledge of stimuli. In addition, our study provided evidence that second language learning experience can aid in paralinguistic processing for non-native speakers in a tonal language. To elucidate the scope and mechanisms of this facilitation effect, it is necessary to examine the perception of emotional prosody in L2 learners with different stages of language proficiency, including elementary, intermediate, and advanced levels. Such investigations have pedagogical implications for L2 education, where emotional cues are often ignored in language learning and language teaching.
VII Conclusions
The present study examined emotional prosody perception in Mandarin Chinese words and sentences for three groups: the native group, the L2 group, and the non-native group. The results showed that native Chinese speakers had an advantage in recognizing emotional prosody in Mandarin Chinese compared to native English speakers without Chinese learning experience, which supports the IGA hypothesis in a tonal language. L1-English L2-Chinese learners also recognized Chinese emotional prosody more accurately than native English speakers without Chinese learning experience, indicating that linguistic learning experience plays a significant role in emotional prosody perception. Interestingly, our study also revealed an interaction between emotion type and language experience: L2 Chinese learners outperformed native Chinese speakers in the perception of positive emotional prosody. We argued that the emotional prosody perception of native Chinese speakers was more biased by linguistic cues (such as lexical tone and semantics), compared to L2 Chinese learners.
Furthermore, we found emotion type and syllable length have impacts on the perception of emotional prosody in Mandarin Chinese. Negative emotional prosody was perceived more accurately than positive emotional prosody in Chinese words and sentences. Although all three groups demonstrated the ability to perceive emotional prosody in a single syllable (monosyllables), the accuracy of emotional prosody perception was found to be positively correlated with the syllable length. Additionally, there was an interaction between emotion type and syllable length on the perception of emotional prosody: native Chinese speakers exhibited the lowest accuracy in identifying positive emotional prosody such as ‘joy’ in monosyllabic stimuli, whereas native English speakers, including L2 and non-native groups, both had the lowest accuracy in recognizing ‘neutral’ prosody in monosyllables. In summary, this study sheds light on the complex nature of emotional prosody perception in tonal languages and highlights the effects and interactions of speaker group, emotion type, and syllable length on emotional prosody perception. Future research should consider both cultural background and linguistic experience when studying emotional prosody perception in the context of second language acquisition.
Footnotes
Acknowledgements
We would like to thank Charles B. Chang and the three anonymous reviewers for their constructive feedback on our manuscript. We also thank Hanbo Yan for help in recording stimuli, Amit Almor for help in recruiting participants, and Mila Tasseva-Kurktchieva for insightful comments.
Declaration of conflicting interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This research was supported in part by Linguistics Program Graduate Student Summer Research Award from University of South Carolina, USA.
Data availability
The data that support the findings of this study are available from the corresponding author upon reasonable request.
