The Time Course of the Pupillary Response to Auditory Emotions in Pseudospeech,Music,and Vocalizations

Abstract

Emotions can be communicated through visual and dynamic characteristics such as smiles and gestures, but also through auditory channels such as laughter, music, and human speech. Pupil dilation has become a notable marker for visual emotion processing; however the pupil's sensitivity to emotional sounds, specifically speech, remains largely underexplored. This study investigated the processing of emotional pseudospeech, which are speech-like sentences devoid of semantic content. We measured participants’ pupil dilations while they listened to pseudospeech, music, and human vocalizations, and subsequently performed an emotion recognition task. Our results showed that emotional pseudospeech can trigger increases of pupil dilation compared to neutral pseudospeech, supporting the use of pupillometry as a tool for indexing prosodic emotion processing in the absence of semantics. However, pupil responses to pseudospeech were smaller and slower than the responses evoked by human vocalizations. The pupillary response was not sensitive enough to distinguish between emotion categories in pseudospeech, but pupil dilations to music and vocalizations reflected some emotion-specific pupillary curves. The valence of the stimulus had a stronger overall influence on pupil size than arousal. These results highlight the potential for pupillometry in studying auditory emotion processing and provide a foundation for contextualizing pseudospeech alongside other affective auditory stimuli.

Keywords

pupillometry pupil dilation pseudospeech valence growth curve analysis

Introduction

Speech is a hugely important means of communication. In order for speech communication to be effective, we need to understand not only the words that an individual is saying, but also the emotion behind the words. There are clinical populations where the recognition of these emotions may be diminished, such as people with hearing impairments or autism (Christensen et al., 2019; Dennis et al., 2000). For these groups, explicit recognition of emotions can be unreliable or even absent, leading to significant barriers in their social and emotional functioning. In this study, we investigate pupil dilation as a physiological marker of auditory emotion processing in pseudospeech, instrumental music, and non-verbal human vocalizations (such as laughter and cries).

The pupillary response has been widely used as an implicit measure of cognitive and emotional processing (Granholm & Steinhauer, 2004; Hess & Polt, 1964; Hoeks & Levelt, 1993). Researchers have extensively investigated this response in visual emotion processing (Bradley et al., 2008; Partala & Surakka, 2003; Snowden et al., 2016), but also for affective auditory sounds like music and vocalizations (Gingras et al., 2015; Oliva & Anikin, 2018). In contrast, emotional speech has received far less attention in pupillometry research, but the few studies on this topic have shown that emotional speech can also evoke an increase in pupil dilation (Jürgens, 2018; Kuchinke et al., 2011). Both of these studies showed that semantically neutral speech elicits a pupillary response in the listener that is greater when presented with emotional prosody compared to neutral prosody. However, quantifying speech emotion processing through pupil dilation is challenging, since speech is a complex signal that conveys more information than emotions alone. In addition, pupil dilation is commonly used to index listening effort and auditory sentence processing in speech (Aydın & Uzun, 2023; Engelhardt et al., 2010; Wagner et al., 2019). If pupil dilation can index emotion processing to prosodic emotional speech in the absence of any semantic content, we can isolate emotion processing from general processing effort.

Given that pupil dilation can index emotional processing in speech, it becomes important to isolate which speech features drive this effect. Speech can contain emotional information in the form of paralinguistics, prosody, and semantics. The combination of these channels is informative, and while listeners naturally integrate this information, there exists evidence for both a prosodic and semantic dominance (Ben-David et al., 2016; Kotz & Paulmann, 2007). To reduce the confound of having conflicting semantic and prosodic emotional information, neuroimaging and behavioral studies employ the use of pseudosentences, consisting of pseudowords (Brosch et al., 2009; Nagels et al., 2020). Another method to separate these two channels is through filtering the speech signal. For instance, Kotz et al. (2003) used a PURR-filter (Prosody Unveiling through Restricted Representation) to remove segmental and lexical content while preserving the acoustic parameters of the signal (Sonntag & Portele, 1998). In this study, we have used non-semantic prosodic pseudospeech, which preserves prosody while eliminating semantic content (Lausen & Hammerschmidt, 2020).

Pupil dilation can reflect autonomic, emotion, semantic, and cognitive processing (Francis & Love, 2020). Pupil dilation can reflect sympathetic activation caused by a highly arousing stimulus, but it can also reflect parasympathetic inhibition (Mcdougal & Gamlin, 2015). It captures both bottom-up processes, such as autonomic arousal, and top-down processes, such as mental effort (Granholm & Steinhauer, 2004). Although these components cannot yet be fully disentangled, we can investigate how pseudospeech relates to two other semantically-free emotional sounds, human vocalizations and instrumental music. By directly comparing emotional responses to prosodic pseudospeech, music, and nonverbal vocalizations, our study provides a cross-domain perspective on auditory emotion processing.

Non-verbal human vocalizations are extremely effective in conveying a speaker's internal state, with conscious processing doing little to improve the accuracy of emotion recognition (Lima et al., 2019). From event-related potential (ERP) research, we see that listeners encode emotional vocalizations as early as 100 ms after stimulus onset, a faster response than emotional speech (Pell et al., 2015; Sauter & Eimer, 2010). As vocalizations are widely associated with eliciting physiological arousal in the listener (Azhari et al., 2018; Groh & Roisman, 2009), we suspect that the pupillary response to human vocalizations is reflecting autonomic arousal. Jürgens et al. (2018) showed that skin conductance, a measure of autonomic arousal, was modulated by affective bursts, but was not modulated by emotional speech. Affective bursts are sounds such as alarms and bombs, and while different from human vocalizations, they clearly induce emotional arousal in the listener.

Emotional music, like speech, evokes a pupillary response in the listener (Gingras et al., 2015) and pupils dilate more for vocal than instrumental music, indicating the higher relevance of vocal information for the listener (Weiss et al., 2016). The complexity of the stimulus also plays a role: in a study investigating the pupillary response to poetic language, it was shown that aesthetic poetry, which is complex and subtle in its emotional meaning, resulted in a slower pupillary response than comical poetry, which is more easily perceptible (Niikuni et al., 2022). We therefore expect a faster or larger pupillary response for more emotionally salient information, such as vocalizations, than for non-vocal music and pseudospeech.

Emotions can be categorized into labels such as happy, sad, or fearful. Alternatively, the Circumplex model of affect describes emotions along continuous valence and arousal dimensions (Posner et al., 2005; Russell, 1980). Pupillometry research has employed both models of emotion, with mixed results as to how pupil size is modulated by valence, arousal, or specific emotion categories (Kret et al., 2013; Oliva & Anikin, 2018; Partala & Surakka, 2003). Some studies suggest pupil dilation is primarily driven by arousal, with larger responses to highly arousing sound stimuli (Partala et al., 2000), while others find that negative valence elicits greater or more sustained pupil dilation (Burley et al., 2017; Geangu et al., 2011). The valence intensity also predicts pupil size, with responses independent of whether the stimulus is positive or negative (Bradley et al., 2008; Oliva & Anikin, 2018), and this pattern extends to semantically neutral speech with emotional prosody (Kuchinke et al., 2011).

In this study, we investigate two research questions.

Is emotion processing reflected in the pupillary response to emotional pseudosentences?

How does the pupillary response to emotional pseudosentences compare to non-word human vocalizations and instrumental music?

We used Dutch accented pseudosentences (Bänziger et al., 2012; Nagels et al., 2020), human vocalizations from the Montreal Affective Voices (MAV) (Belin et al., 2008), and the Musical Affective Bursts (MEB) (Paquette et al., 2013), which were developed to be a musical version of the MAV stimuli. We collected both pupillometry and behavioral measures of emotion. Our participants were presented with a forced-choice task asking them to identify which emotion was heard in the stimulus, and, in addition, we collected valence and arousal ratings of the stimuli.

Our primary hypotheses were:

Emotional pseudosentences will elicit greater pupil dilation than neutral pseudosentences, reflecting the processing of emotion prosody.

Human vocalizations will trigger greater and earlier pupil dilation responses, consistent with autonomic arousal.

The pupil response will be highly dependent on the type of stimulus, valence, arousal, and emotion category.

Methods

Ethics Statement

The experimental procedures were given ethical approval from the Medical Ethical Committee of the University Medical Center Groningen (UMCG). Participants were informed of the study procedures both verbally and through information letters, and written informed consent was collected from each participant prior to the experiment.

Participants

Twenty-one native Dutch participants were recruited (7 male; mean age 23 years; age range 18–30 years), the majority were students (n = 16) and received 8 euro/hour for their participation. Inclusion criteria were: native Dutch speakers, no early second-language acquisition (<3 years), and no reported dyslexia or autism. The participants’ hearing thresholds were tested for normal hearing, at or below 20 dB HL (frequency range 125 Hz–4 kHz) and had normal or corrected-to-normal vision.

Stimuli

There were three different types of stimuli used in the experiment; pseudospeech, music, and vocalizations. Stimuli were RMS-normalized in overall amplitude across all materials. The sound presentation level was fixed at around 65 dB SPL. Given the varying spectro-temporal properties of signals across different stimulus sets, each stimulus set was separately calibrated with a manikin (KEMAR, GRAS) and a sound-pressure level meter (Brüel & Kjær, Svan 979).

Pseudospeech

The pseudospeech stimuli consisted of 48 recordings produced by six speakers (3 male), expressing four emotions (anger, happiness, sadness, neutral). The sentences were taken from validated pseudo-linguistic materials (Bänziger et al., 2012) and recordings were made in a sound-attenuating room at 44.1 kHz. The stimuli were informally validated in an online forced-choice task (Nagels et al., 2020). Stimulus duration ranged from 1159 to 2147 ms. The mean duration (± standard error) was 1419 ± 47 ms for anger, 1546 ± 40 ms for happiness, 1580 ± 72 ms for neutral, and 1585 ± 68 ms for sadness.

Music

The music stimuli consisted of 2 instruments (violin and clarinet), 6 pieces, and 4 emotions (fear, happiness, sadness, neutral), totaling 48 items. The music stimuli were selected from the MEB (Paquette et al., 2013), which were produced with the intention of being a musical version of the MAV (Belin et al., 2008). Paquette et al. (2013) validated the database with 60 participants (19 males) in an online validation. Stimulus duration ranged from 400 to 3110 ms. The mean duration (± standard error) was 928 ± 59 ms for fear, 1529 ± 95 ms for happiness, 1421 ± 70 ms for neutral, and 2608 ± 136 ms for sadness.

Vocalizations

The vocalization stimuli consisted of 10 speakers (5 male) producing the utterance ah (/ɑ/) in 6 emotions (anger, fear, happiness, sadness, pleasure, neutral), totaling 60 items. The vocalizations used in the experiment were a subset of the MAV and were formally validated (Belin et al., 2008). Stimulus duration ranged from 240 to 3600 ms. The mean duration (± standard error) was 851 ± 109 ms for anger, 580 ± 59 ms for fear, 1402 ± 175 ms for happiness, 932 ± 141 ms for neutral, 1363 ± 142 ms for pleasure, and 2135 ± 236 ms for sadness.

Procedure

Prior to the experiment, participants were given instructions on how to make their judgements of valence and arousal for the stimuli. The term valence was described to them as the degree of positive or negative emotion conveyed by the sound, and arousal as how much excitation is elicited by the emotion in the sound. The participants were given examples of the extremes of the scales from each stimulus set, and were informed that they would choose the emotion category (happy, sad etc.) that best represented the emotion in the sound. However, participants were not given descriptions of the meaning of the emotion categories.

Participants were seated in a sound-treated, dimly-lit room where the lighting was kept constant throughout the experiment. A head-mounted Eyelink II (SR Research) eye-tracker was used for all participants. The procedure started with a short practice block (20 trials, all stimulus types), to familiarize participants with the rating procedure and the range of stimulus types. The experimenter was present during this period and participants were allowed to ask questions about the experiment and the rating scales. Following the practice block, the three experimental blocks (pseudospeech, music, vocalizations) were presented in a random order.

For each trial, a fixation cross was presented 1 s prior to the stimulus onset, and it remained until 3 s after the stimulus offset. Participants were instructed to minimize blinking and focus on the cross while it was presented. After stimulus presentation, participants rated valence on a continuous scale (1 = very negative; 9 = very positive), and arousal (1 = low arousal; 9 = high arousal). Finally, the participant performed a forced-choice task (pseudospeech and music: 4 categories, vocalizations: 6 categories) where they chose the word that best represented the emotion conveyed in the sound. Participants used a mouse to make their selection, and before the beginning of the next trial, a screen instructing the participants to blink was presented.

Pupil Data Preprocessing

Matlab (version R2017a) was used to carry out the preprocessing of the pupil data. Firstly, each participant's raw pupil data was divided into trials, which started from 1 s before the onset to three seconds after the onset of the stimulus. Then, each trial was checked for blinks exceeding 300 ms. Trials with blinks longer than 300 ms were discarded (4.5% of all data) to avoid artefacts from interpolation.

Dilation outliers were detected using the median absolute deviation, which is a robust metric to detect blinks in pupillometry data (Kret & Sjak-Shie, 2019; Leys et al., 2013). Samples exceeding a velocity threshold of 16 (in line with pupil dilation limits; Mathôt et al., 2018) were discarded. The missing values were interpolated via a piecewise cubic hermite interpolating polynomial. The pre-trial baseline was obtained by taking the average of the pupil size in the 1 s prior to the stimulus onset. The event-related pupil dilation (ERPD) was calculated from the following equation (Wagner et al., 2019):

E R P D = \frac{p u p i l s i z e - b a s e l i n e}{b a s e l i n e} \times 100

This ERPD provides the percentage change in pupil dilation relative to the pre-trial baseline. Next, we smoothed our data by reducing the frequency from 250 to 50 Hz through averaging consecutive data points into bins of 20 ms (effectively over 5 data points).

Results

Behavioral Data

The behavioral data were collected to confirm the categorical and dimensional models of emotions in our three stimulus types: pseudospeech, music, and vocalizations. The music (MEB) and vocalization (MAV) stimulus sets were previously validated for both dimensional and categorical models of emotion, while the pseudospeech (EmoHI) stimuli were validated only for the categorical model of emotion (Belin et al., 2008; Nagels et al., 2020; Paquette et al., 2013). Since these three validations were done separately, we collected both dimensional and categorical ratings of emotion to allow direct, cross-modal comparisons. The statistical analyses and plotting were performed in Rstudio (www.R-project.org, R version 4.2.2). Linear mixed effects models (LMER) and generalized linear mixed effects models (GLMER) were fitted using the lmer and glmer function of the lme4 package respectively (version 1.1–35.1) (Bates et al., 2015). We tested if each term in our models improved the model fit via likelihood ratio tests until we ended up with the most parsimonious model, this included testing for random intercepts and random slopes, up until the most complex random effects structure. The lmerTest package was used to calculate p-values based on Satterthwaite's method (Kuznetsova et al., 2017). We tested the LMER model assumptions for (1) normality of residuals, (2) linearity (for models with continuous predictors), and (3) homogeneity of variances via diagnostic plots. Weights were added to some LMER models to stabilize variance.

Categorical Ratings

Figure 1 shows the percent correct recognition scores for each emotion category across stimulus types. In general, participants correctly identified emotion categories above chance level. A notable confusion was between happiness and pleasure in vocalizations (21% and 13% accuracy, respectively), likely due to their shared positive valence and the fact that no explicit instructions for emotion labels were given to the participants. In order to reduce the statistical error, we grouped together the happiness and pleasure categories, meaning that if a participant rated happiness as pleasure, or vice versa, we concluded this as being a correct answer. Confusion matrices can be found in the Supplemental material (Supplemental Figure 1 ‘Speech’; Supplemental Figure 2 ‘Music’; and Supplemental Figure 3 ‘Vocalizations’).

Figure 1.

Box Plots Displaying Percent Correct Emotion Recognition Scores for Pseudospeech, Music, and Vocalizations. Horizontal Lines in Bold Indicate Medians, Boxes Indicate Data Within the 25th to 75th Percentile, Whiskers Indicate Data Within the 5th to 95th Percentile, and Circles Indicate Outliers.

For the pseudospeech stimuli, participants’ scored above the chance level of 25% for all four categories, namely, neutral (86.9%, SE = 0.022), sadness (78.3%, SE = 0.027), anger (83.3%, SE = 0.025), and happiness (75.6%, SE = 0.030). For the music stimuli, participants’ scored well above the chance level of 25% for all four categories, namely, neutral (76.8%, SE = 0.028), sadness (81.0%, SE = 0.024), fear (78.0%, SE = 0.028), and happiness (91.0%, SE = 0.017). The vocalization ratings show high accuracy scores above chance level (16.7%) for the six categories; neutral (95.0%, SE = 0.016), sadness (96.6%, SE = 0.014), anger (80.3%, SE = 0.032), fear (82.6%, SE = 0.029), happiness (99.4%, SE = 0.006), and pleasure (81.2%, SE = 0.034).

We removed extreme outliers based on the z-score method over four iterations, where a z-score of less than −3 or greater than 3 for a given data point was removed, resulting in two audio files being removed from the dataset (overall recognition rates of 0% and 10% across all participants) along with several participant-emotion combinations. A GLMER revealed a significant interaction between emotion category and stimulus type on recognition accuracy, χ²(6) = 27.253, p < .001. Notably, neutral emotions were more difficult to recognize in music (Est = −1.093, SE = 0.528, p = .0386), while sadness was most easily recognized in vocalizations (Est = –2.141, SE = 0.732, p = .00347). Full model estimates and additional comparisons are provided in Supplemental Tables 1(a) to (e), and the random effect variance estimate is provided in Supplemental Table 1(f).

Valence and Arousal Ratings

Figure 2 shows the participants’ ratings of valence and arousal for all three stimuli types of pseudospeech, music, and vocalizations. The valence and arousal data were first analyzed on an ordinal scale, by binning the responses and running the model as an ordinal logistic mixed effect model. But the model assumptions were violated and the data were more suited as a linear model, so the analyses were done with a linear mixed-effects model. A significant interaction between stimulus type and emotion category influenced valence ratings, χ²(6) = 65.203, p < .001. Music slightly decreased valence compared to pseudospeech (Est = –0.397, SE = 0.181, p = .0295), while both music (Est = 0.949, SE = 0.263, p < .001) and vocalizations increased valence when associated with happiness (Est = 1.151, SE = 0.263, p < .001). The full model estimates are provided in Supplemental Table 2(a) and the random effect variance estimates in Supplemental Table 2(b).

Figure 2.

Participant Ratings of (a) Valence (Top Panel) and (b) Arousal (Bottom Panel) for Each Stimulus Type; Pseudospeech, Music, and Vocalizations. Horizontal Lines in Bold Indicate Medians, Boxes Indicate Data Within the 25th to 75th Percentile, Whiskers Indicate Data Within the 5th to 95th Percentile, and Circles Indicate Outliers. Values Range From 1 to 9 and the Dotted Line at 5 Represents the Midway Point.

Similarly, for arousal, an interaction effect of stimulus type and emotion was found to significantly improve model fit compared to a model without the interaction, χ²(6) = 71.54, p < .001. For stimulus type, vocalizations (Est = –0.448, SE = 0.281, p = .0846) and music (Est = −0.116, SE = 0.263, p = .6593) did not differ significantly from pseudospeech. There was a strong positive interaction effect of sadness with vocalizations (Est = 2.973, SE = 0.381, p < .001) and music (Est = 1.543, SE = 0.340, p < .001), revealing that arousal increased significantly. The interaction between anger and vocalizations (Est = 0.835, SE = 0.387, p = .0351) was also significant. The full model estimates are provided in Supplemental Table 3(a) and the random effect variance estimates in Supplemental Table 3(b).

Pupillometry Data

We analyzed ERPD in response to emotional pseudospeech, music, and vocalizations using growth curve analysis (GCA; Mirman, 2014). GCA allows us to capture the time course of pupil dilation, via linear, quadratic, and cubic polynomial time terms. The linear time term represents the overall slope of the pupil data (larger values indicating greater slope), the quadratic term reflects the symmetricity in the peak pupil dilation (negative values indicating a sharper peak) and the cubic term is both a qualifier of the linear term and the extent to which a second inflection point is represented in the tails, where more positive values indicate a ‘sharper’ pupil response (McGarrigle et al., 2017; Neagu et al., 2023; Wagner et al., 2019). The statistical analyses and plotting were performed in Rstudio (www.R-project.org, R version 4.4.2). Pairwise post hoc tests were conducted using the glht function of the multcomp package (Hothorn et al., 2008). We used linear mixed effects regression models from the lme4 package (Bates et al., 2015) and ran different models for the emotion categories and the dimensional model of emotion. We included valence and arousal as centered, continuous variables. Each stimuli set was analyzed using the same time window, which was the shortest trial length (0 to 3240 ms after stimulus onset). Stepwise model comparison in conjunction with visual inspection was used to determine the best terms to use in each model. We started with the null model and subsequently added effects until we found the most parsimonious model. We included maximal random effects structures with random slopes and intercepts but subsequently removed higher order terms due to convergence errors. We looked at the fit of the model to the data so as to avoid overfitting. Statistical significance for the individual parameter estimates was assessed by using the normal approximation (i.e. treating the t-value as a z-value).

Figure 3 shows the ERPDs with 95% confidence intervals and the fitted GCA model values for pseudospeech, music, and vocalizations (Supplemental material, Tables 4, 5 and 7). In Figure 3(Speech) we see the pseudospeech stimulus type for the four emotion categories; neutral, anger, happiness, and sadness. The overall shape of the ERPDs in Figure 3(Speech) is similar for all four categories, showing an increase from baseline, peaking around 1500 ms and then returning to baseline. We see that the three categories of anger, happiness, and sadness reach a peak that is almost double in size of the neutral category. In Figure 3(Music) we see the music stimulus type for the four emotion categories: neutral, fear, happiness, and sadness. The overall shape of the ERPDs in Figure 3(Music) are generally different from each other. The fear category has a large and early peak, and sadness has a smaller and later peak, with happiness somewhere in between. The neutral category has the smallest peak pupil dilation. In Figure 3(Vocalizations) we see the vocalization stimuli type for the six emotion categories: neutral, fear, anger, happiness, pleasure, and sadness. The neutral category peaks higher than the pseudospeech and music ERPDs. The sadness category has a large and late peak. Happiness and pleasure have the second largest peak dilation and both follow a very similar pattern.

Figure 3.

The Change in Pupil Dilation, as Reflected by the Event Related Pupil Dilation (ERPD) and the Fitted Model Values for Emotion Category in Each Stimulus Type of Pseudospeech, Music, and Vocalizations. The Shaded Interval Indicates the 95% Confidence Interval of the Pupil Dilation Curves.

Research Question 1: Is emotion processing reflected in the pupillary response to emotional pseudosentences?

Pseudospeech

Table 1 shows the summary model estimates and significance codes for the pseudospeech emotion category model, the full model estimates and standard errors can be found in the Supplemental material, Table 4(a). The full model included a fixed effect of emotion, χ²(3) = 13.8, p = .003, and emotion as an interaction on the quadratic, χ²(3) = 9.7, p = .021, term of the polynomial. Neutral was included in the intercept, and random effects of subject on the linear, quadratic, and cubic terms, as well as subject-by-emotion on the linear, quadratic, and cubic terms were included. In Figure 3(Speech), we see that the emotion categories have a greater average dilation than the neutral category, and this was reflected in the intercept of our GCA model (Anger: Est = 0.783, SE = 0.243, p = .001; Happiness: Est = 0.662, SE = 0.243, p = .006; Sadness: Est = 0.849, SE = 0.243, p < .001). The sharpest peak dilation, as reflected by the quadratic time term of the polynomial, was observed for the anger category (Est = −3.797, SE = 1.277, p = .003). The addition of the linear time term as an interaction on emotion did not improve model fit, χ²(3) = 2.6, p = .463, nor did the addition of the cubic term, χ²(3) = 5.6, p = .133, indicating that differences between emotional categories were captured in the magnitude and symmetry of dilation, rather than the time course of shape of the curve.

Table 1.

Summary Model Estimates for the Pseudospeech Emotion Category Model, Showing the Interaction of Each Emotion Category With Each Time Term.

	Intercept	Quadratic
Intercept	0.762	−6.112 (*)
Anger	0.783 (*)	−3.797 (*)
Happiness	0.663 (*)	−3.111 (*)
Sadness	0.849 (*)	−3.309 (*)

Model formula: erpd ∼ Linear + Cubic + Quadratic * Emotion + (Linear + Quadratic + Cubic | Subject:Emotion) + (Linear + Quadratic + Cubic | Subject). *p < .05. Total number of observations: 147,258, number of subjects: 21, number of subject:emotion: 84.

Research Question 2. How does the pupillary response to emotional pseudosentences compare to non-word human vocalizations and instrumental music?

Music

Table 2 shows the summary model estimates and significance codes for the music emotion category model, the full model estimates and standard errors can be found in the Supplemental material, Table 5(a). Our final model included fixed effects of emotion, χ²(3) = 13.1, p = .004, and an interaction of emotion on the linear, χ²(3) = 20.2, p < .001; quadratic χ²(3) = 19.3, p < .001; and cubic time term of the polynomial, χ²(3) = 34.0, p < .001. Neutral was included in the intercept, and random effects of subject on the linear, quadratic, and cubic time term were included, as well as subject-by-emotion on the linear, quadratic, and cubic time term. There was an effect of emotion category on the intercept of our model, indicating that each emotion had a significant greater average ERPD than neutral (Fear: Est = 1.026, SE = 0.294, p < .001; Happiness: Est = 0.882, SE = 0.294, p = .003; Sadness: Est = 0.745, SE = 0.294, p = .011). The largest overall dilation was observed for fear, which peaked early and had a steep decline, as reflected in the negative linear time term (Est = –3.595, SE = 1.154, p = .054). In Figure 3 (Music) we can see how sadness peaks much later than neutral, fear and happiness, this is reflected in the linear time term, which showed that the sadness category (Est = 5.394, SE = 1.866, p = .004) had the greatest overall slope of the pupillary curve. The curve with the most symmetricity around the peak, as reflected by the quadratic time term, was observed for the happiness category (Est = −5.449, SE = 0.160, p < .001). The curve with the sharpest rise and fall, reflected by the cubic time term, is seen for the fear category (Est = 3.818, SE = 0.806, p < .001).

Table 2.

Summary Model Estimates and Significance Codes for the Music Emotion Category Model, Showing the Interaction of Each Emotion Category With Each Time Term.

	Intercept	Linear	Quadratic	Cubic
Intercept	0.473	−3.744	−3.592 (*)	0.904
Fear	1.026 (*)	−3.595	−3.242 (*)	3.818 (*)
Happiness	0.882 (*)	1.429	−5.449 (*)	1.195
Sadness	0.745 (*)	5.394 (*)	−3.191 (*)	−1.390

Model formula: erpd ∼ (Linear + Quadratic + Cubic) * Emotion + (Linear + Quadratic + Cubic| Subject) + (Linear + Quadratic + Cubic | Subject:Emotion). *p < .05. Total number of observations: 173,988, number of subjects: 21, number of subject:emotion: 84.

A post hoc analysis revealed differences between some emotion categories on different time terms of the polynomial. The full results of the post hoc tests can be found in the Supplemental material, Table 6. There were differences in the cubic time term for fear with each other emotion (fear vs. neutral: z-value = –4.735, p < .001; fear vs. happiness: z value = –3.257, p = .023; fear vs. sadness: z value = –6.451, p < .001) and differences with happiness vs. sadness: z value = –3.207, p = .027. The quadratic time term revealed only one significant pairwise comparison, neutral vs. happiness: z value = −4.722, p < .001. The linear time term confirmed the delayed ERPD peak in sadness vs. fear: z value = 4.817, p < .001. The average ERPD response revealed differences in fear vs. neutral: z value = –3.495, p = .0104 and near differences in neutral vs. happiness: z value = 3.006, p = .050.

Vocalizations

Table 3 shows the summary model estimates and significance codes for the vocalizations emotion category model, the full model estimates and standard errors can be found in the Supplemental material, Table 7(a). Our full model included fixed effects of emotion, χ²(5) = 35.8, p < .001, and an interaction of emotion on the linear time term, χ²(5) = 21.4, p = .001 and the quadratic time term, χ²(5) = 29.4, p < .001. Neutral was included in the intercept, and random effects of subject on all time terms of the polynomial, as well as subject-by-emotion on all time terms of the polynomial were included. The pupillary curve with the greatest intercept, and therefore the greatest area under the curve, was observed for the sadness stimuli (Est = 2.05, SE = 0.367, p < .001). In Figure 3(Vocalizations), we see how each of the emotion categories peaks and declines similarly, except for the sadness category, which has a wider peak and slow decline. This observation is reflected in the interaction with emotion on the linear time term, which shows the high model estimate value for the sadness stimuli (Est = 8.25, SE = 1.927, p < .001). The emotion curve with the greatest symmetry around the peak, as reflected by the quadratic time term, was observed for the happiness category (Est = –5.8231, SE = 1.402, p < .001).

Table 3.

Summary Model Estimates and Significance Codes for the Full Vocalizations Emotion Category Model, Showing the Interaction of Each Emotion Category With Each Time Term.

	Intercept	Linear	Quadratic
Intercept	1.109 (*)	3.608 (*)	−6.787 (*)
Anger	0.400	0.885	−1.590
Fear	0.056	0.598	−0.172
Happiness	1.206 (*)	1.199	−5.823 (*)
Pleasure	1.018 (*)	1.892	−5.369 (*)
Sadness	2.051 (*)	8.24 (*)	−4.715 (*)

Full model: erpd ∼ (Linear + Quadratic) * Emotion + Cubic + (Linear + Quadratic + Cubic | Subject:Emotion) + (Linear + Quadratic + Cubic | Subject). *p < .05. Total number of observations: 168,804, number of subjects: 21, number of subject:emotion: 158.

Fear and anger, by contrast, did not differ significantly from neutral on any time term. The interaction of emotion on the cubic time term of the polynomial did not improve model fit χ²(5) = 10.396, p = .065. This indicates that the rise and fall of the pupillary curves did not differ between emotions, this is reflected in Figure 3(Vocalizations), as we see that each of the emotion curves have similar timing in their peak dilations.

A post hoc analysis revealed differences between some emotion categories on different time terms of the polynomial. The full results of the post hoc tests can be found in the Supplemental material, Table 8. For the quadratic time term, fear had differences with three emotion categories (fear vs. pleasure: z value = 3.709, p < .01; fear vs. happiness: z value = –4.209, p < .01; fear vs. sadness: z value = –3.242, p = .043) and neutral revealed the same differences as with fear (neutral vs. pleasure: z value = 3.833, p < .01; neutral vs. happiness: z value = –4.153, p < .01; neutral vs. sadness: z value = –3.366, p = .029). In the linear time term, sadness revealed significant differences with every other emotion category (sadness vs. neutral: z value = 4.281, p < .001; sadness vs. fear: z value = 3.969, p < .01; sadness vs. anger: z value = 3.821, p < .01; sadness vs. pleasure: z value = 3.299, p < .036; sadness vs. happiness: z value = 3.654, p = .010). The average ERPD response revealed differences for three sadness comparisons (sadness vs. neutral: z value = 5.591, p < .001; sadness vs. fear: z value = 5.438, p = <.001; sadness vs. anger: z value = 4.501, p < .001) and neutral vs. happiness: z value = 3.287, p = .038.

Figures 4 and 5 show the valence and arousal fitted model values and the ERPDs with 95% confidence intervals pseudospeech, music, and vocalizations. In Figure 5 we see that high arousal results in overall larger pupillary curves in pseudospeech, music, and vocalizations. Figure 5(Music) shows the strongest distinction between high, medium, and low arousal. In Figure 4, the high and low valence ERPDs are all greater than those with medium valence, with low valence having the greatest ERPD in Figure 4(Speech) and Figure 4(Music), but high valence has the largest pupil response in vocalizations.

Figure 4.

The Change in Pupil Dilation, as Reflected by the Event Related Pupil Dilation (ERPD) and the Fitted Model Values for Valence in Each Stimulus Type of Pseudospeech, Music, and Vocalizations. The Shaded Interval Indicates the 95% Confidence Interval of the Pupil Dilation Curves.

Figure 5.

The Change in Pupil Dilation, as Reflected by the Event Related Pupil Dilation (ERPD) and the Fitted Model Values for Arousal in Each Stimulus Type of Pseudospeech, Music, and Vocalizations. The Shaded Interval Indicates the 95% Confidence Interval of the Pupil Dilation Curves.

Pseudospeech, Music, and Vocalizations

Table 4 shows the full model estimates and standard errors for the valence and arousal stimuli comparison model. The random effect variance estimates can be found in the Supplemental Table 9. We tested the effect of stimuli type, valence, and arousal on ERPD. Our full model included a three-way interaction of the quadratic term and stimulus type on valence, χ²(2) = 73.002, p < .001, and arousal, χ²(2) = 38.666, p < .001. The model included interactions of valence with the linear, χ²(2) = 43.881, p < .001, and cubic term of the polynomial, χ²(2) = 71.602, p < .001. Pseudospeech was included in the intercept, and random effects of subject on the linear, quadratic, and cubic time terms, as well as item on the linear, quadratic, and cubic time terms were included.

Table 4.

Full Model Estimates and Standard Errors for the Valence and Arousal Stimuli Model.

	Estimate	SE	t value	p	Sig
(Intercept)	1.402	0.417	3.364	<.001	*
Linear time term	−3.228	2.057	−1.570	0.167
Quadratic time term	−8.763	1.358	−6.454	<.001	*
Cubic time term	2.249	0.643	2.496	<.001	*
Valence	−0.137	0.010	−14.075	<.001	*
Arousal	0.091	0.010	9.292	<.001	*
Stimulus Music	−0.228	0.170	−1.347	0.178
Stimulus Vocalizations	0.397	0.171	2.322	0.020	*
Linear term:Valence	−1.459	0.064	−22.694	<.001	*
Quadratic term:Valence	0.645	0.117	5.527	<.001	*
Quadratic term:Arousal	−0.199	0.115	−1.735	0.083
Cubic term:Valence	0.318	0.056	5.627	<.001	*
Quadratic term:Music	2.241	0.827	2.709	0.007	*
Quadratic term:Vocalizations	−0.281	0.834	−0.337	0.736
Valence:Music	−0.090	0.013	−7.111	<.001	*
Valence:Vocalizations	0.170	0.014	12.292	<.001	*
Arousal:Music	−0.019	0.013	−1.479	0.139
Arousal:Vocalizations	0.045	0.013	3.523	<.001	*
Quadratic term:Valence:Music	−0.245	0.152	−1.614	0.107
Quadratic term:Valence:Vocalizations	−1.358	0.160	−8.464	<.001	*
Quadratic term:Arousal:Music	−0.543	0.149	−3.640	<.001	*
Quadratic term:Arousal:Vocalizations	−0.672	0.154	−4.371	<.001	*

Full model formula: erpd ∼ Quadratic * Stimulus Type * (Valence + Arousal) + (Linear + Cubic) * Valence + (Linear + Quadratic + Cubic | Subject) + (Linear + Quadratic + Cubic | Item). * p < .05. Total number of observations: 490,050, number of subject: 21, number of items: 158.

There was a significant interaction of valence with the linear, Est = –1.459, SE = 0.064, p < .001, quadratic, Est = 0.645, SE = 0.117, p < .001, and cubic time term, Est = 0.318, SE = 0.056, p <.001. The large negative effect of valence on the linear term reflects a sharp decrease in the pupil at earlier time points. Valence had significant interactions with stimulus (Music: Est = –0.090, SE = 0.013, p < .001; Vocalizations: Est = 0.16, SE = 0.014, p < .001) and three-way interactions with vocalizations on the quadratic term, Est = –1.358, SE = 0.160, p < .001. Valence had a strong effect on the quadratic term in vocalizations, which reflects the curvature of the pupil around the peak. There was a significant interaction of arousal with vocalizations, Est = 0.045, SE = 0.013, p < .001, and in a three-way interaction with the quadratic time term, Est = –0.672, SE = 0.154, p < .001. There was a significant three-way interaction of arousal with music on the quadratic term, Est = –0.543, SE = 0.149, p < .001, and a significant positive interaction of vocalizations with valence, Est = 0.170, SE = 0.014, p < .001, reflecting that the average ERPD increases for vocalizations by valence.

Discussion

The current study examined the use of pupil dilation as a measure of auditory emotion processing. We identified that emotional pseudosentences elicit an overall higher average pupil dilation than neutral, in combination with a faster rise and fall in the time course of the pupil, thus confirming that pupil dilation can index emotion processing in pseudospeech. A similar effect was seen in music, with more nuances in the time course of pupil dilation. Although fear and anger in vocalizations failed to evoke a significant difference from neutral, the other emotions all showed increases in dilation compared with neutral. Vocalizations resulted in overall higher average dilations than pseudospeech, and average dilations in music did not differ from pseudospeech. We also report that valence has an overall stronger impact on pupil dilation than arousal. Our results are in line with previous research, showing that the pupillary response results in increased dilations for emotional speech, music, and human vocalizations (Gingras et al., 2015; Jürgens et al., 2018; Oliva & Anikin, 2018). The pupillary response to pseudospeech was not sensitive enough to distinguish between emotion categories, however, the time course of the pupillary response to music reflected differences between emotion categories.

Pseudospeech

Our results are the first to indicate that purely prosodic emotional pseudospeech triggers increased pupil dilations. This index in pupil dilation is likely a reflection of emotion processing, rather than cognitive, as participants showed no differences in recognition rates between emotion categories and neutral. Low valence has a strong impact on the pupillary response, with sadness producing the greatest average pupil dilation, and anger the strongest effect on the symmetricity of the peak pupil dilation. We did not see differences in the time course of the pupil between emotions in our pseudospeech stimuli, which is in line with research showing that the pupil might not be sensitive enough to delineate between emotions expressed in speech (Partala & Surakka, 2003), which has also been observed in semantically neutral speech (Kuchinke et al., 2011). Although Jürgens et al. (2018) showed that pupil responses to angry speech stimuli dilated faster, and the pupil response to joyful speech was delayed by 1 s. In addition, in poetic language, the pupil can differentiate between semantic comical and aesthetical poetry that were presented in a neutral tone (Niikuni et al., 2022). This shows that the pupil might be sensitive enough to distinguish between emotion categories for semantic speech stimuli, further research in pupil dilation studies should look at combining both semantic and prosodic emotion information.

Music

Pupil responses to music varied over time depending on the emotion. Fear evoked an early and strong response, likely a reflection of autonomic arousal (Mathôt, 2018), and in line with research that highly arousing music induces increased pupil dilation compared to low arousal music (Gingras et al., 2015). Sadness evoked a slower, gradual increase, likely indicative of cognitive and emotional processing. These differences can be explained by stimulus characteristics: the MEB fear stimuli were portrayed by a high-pitched note, whereas sadness was expressed as a short minor melody. However, our stimuli were not controlled for duration, and the fear stimuli were on average 1 s while the sadness stimuli were more than 2 s. Pupil dilation is sensitive to the duration of the stimulus and this can explain the differences we observed between emotions (Oliva & Anikin, 2018). Duration is inherent to certain emotions, sadness tends to unfold more slowly than fear or anger (Yildirim et al., 2004), it is an acoustic pattern that is a part of the emotion and future studies should account for these acoustic-emotional links.

Vocalizations

The greatest average pupillary response was observed for sadness, which showed an extended peak dilation without a return to baseline in the analysis time window. This sustained dilation to sadness could be related to the social relevance of cries, thus eliciting a stronger and more sustained emotional response than to that of laughter (Cosme et al., 2021). In contrast, high-valence emotions, like happiness and pleasure, evoked sharp and symmetric peak pupil dilations. Previous research has shown mixed results as to whether positively and negatively valence emotions differentially affect pupil size (Bradley et al., 2008; Jürgens et al., 2018; Kuchinke et al., 2011). Our results found that anger and fear did not differ significantly from neutral on any of the time components of the pupillary curve. This could be related to duration; anger, fear, and neutral had the shortest durations of all the categories. It could also reflect the confidence of participants’ ratings to human vocalizations, and anger and fear were less easily recognized compared to neutral, happiness, and sadness. But this lack of distinction from neutral could also be explained by emotion processing, as both anger and fear were rated as having less negative valence than sadness, and pupils respond to the intensity of the valence of vocalizations, with less intense stimuli producing smaller dilations (Oliva & Anikin, 2018). Their results also showed percentage change in pupil response ranging from 3% to 13%, where a low percentage change occurred when the relative baseline was high. Our ERPD results showed percentage pupillary changes around 3%–4%, which could be related to a high relative baseline pupil dilation in our participants.

Limitations

The pupil is a known indicator of autonomic activation, attention, and cognitive processing (Granholm & Steinhauer, 2004; Hess & Polt, 1964). In our experiment, participants listened to emotional auditory stimuli as we recorded their pupil dilations, and then performed an emotion recognition task. The pupillary responses observed likely reflect a combination of both autonomic activation and cognitive processing, as participants were required to rate the stimuli. This combination is inherent in our stimuli sets, where vocalizations are an indicator of autonomic arousal, and pseudospeech and music require more processing of information to recognize an emotion. Our study included six emotion categories for human vocalizations, four for music, and four for pseudospeech. While emotions in vocalizations are generally easier to recognize than speech (Hawk et al., 2009) and vocalizations had similar or higher rates of recognition than music and pseudospeech in our results, the larger stimulus set in vocalizations could have contributed to differences in pupillary responses arising from task difficulty. Future studies would benefit from including a number of additional measures such as subjective difficulty ratings and both a passive and active-listening task to further disentangle the mechanisms underlying pupil responses. While we intentionally broadened the context by comparing pseudospeech, music, and vocalizations, we acknowledge that fully disentangling these mechanisms may not be entirely possible.

GCA was used to analyze the ERPD results (Mirman, 2014). While this is a powerful method and can simplify interpretation by analyzing the pupil on different time curves, the method does not account for changes in gaze position or autocorrelation of pupil dilation. Other methods such as generalized additive mixed modelling (van Rij et al., 2019) and the temporal response function (Crosse et al., 2016) provide built-in methods to handle autocorrelation of the dependent variable. Future studies would benefit from using methods to handle autocorrelation of the pupil signal for more robust results. Our participants rated valence and arousal on a scale of 1–9, and could choose any number in between the whole numbers, up to three decimal places. We therefore needed to treat this data as continuous, but this would have been more suited as an ordinal scale. These ratings represent a subjective distance that varies between participants, in future studies this data should include a fixed whole number Likert scale that is then treated as ordinal.

Applications

Pupil dilation is known to reflect both cognitive and emotional processing, and does not reflect explicit emotion recognition, but its sensitivity to emotional salience offers insight into underlying emotion perception. This insight may benefit different populations with known difficulties in emotion recognition such as autism, hearing loss, and second language speakers (Christensen et al., 2019; Goy et al., 2018; Iacozza et al., 2017; Kuchinke et al., 2011; Parbery-Clark et al., 2013). This research could aid in improving hearing aids and cochlear implants for emotion recognition, by using pupillometry as a tool to measure emotion perception in speech, as well as short musical and human vocalization excerpts. Individuals with autism focus on semantics when trying to understand emotions in speech (Dennis et al., 2000), whereas individuals with hearing loss have increased effort to understand speech. The applicability of pseudosentences in testing these populations provides a clear stimulus to test emotion processing of speech prosody. The intensity of the valence/arousal of the stimulus, as we’ve shown, is not the only factor in modulating pupil dilation, the type of sound stimulus is also a factor. Affective stimuli corpora often include a range of affective stimuli types, from crying babies, to animal noises, to environmental sounds (Parsons et al., 2014; Yang et al., 2018), and future studies should take into account the effect of stimulus type when measuring auditory emotion processing.

Supplemental Material

sj-docx-1-tia-10.1177_23312165251365824 - Supplemental material for The Time Course of the Pupillary Response to Auditory Emotions in Pseudospeech, Music, and Vocalizations

Supplemental material, sj-docx-1-tia-10.1177_23312165251365824 for The Time Course of the Pupillary Response to Auditory Emotions in Pseudospeech, Music, and Vocalizations by Julie Kirwan, Deniz Başkent and Anita Wagner in Trends in Hearing

Footnotes

Acknowledgments

We would like to thank the University of Groningen high performance computing cluster Hábrók for providing the cluster used to compute the statistical models. We are grateful to the authors and contributors of the stimuli used in this experiment, the Montreal Affective Voices (MAV), the Musical Emotional Bursts (MEB), and the EmoHI stimuli. We would like to thank Paolo Toffanin for his technical support. We would like to acknowledge the assistance of OpenAI's ChatGPT in helping to refine language and improve the clarity. We also thank the reviewers for their comments as these helped to greatly improve this manuscript.

ORCID iDs

Julie Kirwan

Deniz Başkent

Anita Wagner

Funding

The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This project was supported by funding from the EU's H2020 research and innovation program under the MSCA GA 675324, the ENRICH network ().

Declaration of Conflicting Interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Data Availability

The data for this project will be made available online in the DataverseNL repository (Kirwan et al., 2025).

Supplemental Material

Supplemental material for this article is available online.

References

Aydın

Ö.

Uzun

İ. P.

(2023). Pupil dilation response to prosody and syntax during auditory sentence processing. Journal of Psycholinguistic Research, 52(1), 153–177. https://doi.org/10.1007/s10936-021-09830-y

Azhari

Truzzi

Rigo

Bornstein

M. H.

Esposito

(2018). Putting salient vocalizations in context: Adults’ physiological arousal to emotive cues in domestic and external environments. Physiology and Behavior, 196(June), 25–32. https://doi.org/10.1016/j.physbeh.2018.08.010

Bänziger

Mortillaro

Scherer

K. R.

(2012). Introducing the Geneva multimodal expression corpus for experimental research on emotion perception. Emotion, 12(5), 1161–1179. https://doi.org/10.1037/a0025827

Bates

Mächler

Bolker

B. M.

Walker

S. C.

(2015). Fitting linear mixed-effects models using lme4. Journal of Statistical Software, 67(1), 1–48. https://doi.org/10.18637/jss.v067.i01

Belin

Fillion-Bilodeau

Gosselin

(2008). The Montreal affective voices: A validated set of nonverbal affect bursts for research on auditory affective processing. Behavior Research Methods, 40(2), 531–539. https://doi.org/10.3758/BRM.40.2.531

Ben-David

B. M.

Multani

Shakuf

Rudzicz

van Lieshout

P. H. H. M.

(2016). Prosody and semantics are separate but not separable channels in the perception of emotional speech: Test for rating of emotions in speech. Journal of Speech, Language, and Hearing Research, 59(1), 72–89. https://doi.org/10.1044/2015_JSLHR-H-14-0323

Bradley

M. M.

Miccoli

Escrig

M. A.

Lang

P. J.

(2008). The pupil as a measure of emotional arousal and autonomic activation. Psychophysiology, 45(4), 602–607. https://doi.org/10.1111/j.1469-8986.2008.00654.x

Brosch

Grandjean

Sander

Scherer

K. R.

(2009). Cross-modal emotional attention: Emotional voices modulate early stages of visual processing. Journal of Cognitive Neuroscience, 21(9), 1670–1679. https://doi.org/10.1162/jocn.2009.21110

Burley

D. T.

Gray

N. S.

Snowden

R. J.

(2017). As far as the eye can see: Relationship between psychopathic traits and pupil response to affective stimuli. PLoS ONE, 12(1), 1–22. https://doi.org/10.1371/journal.pone.0167436

10.

Christensen

J. A.

Sis

Kulkarni

A. M.

Chatterjee

(2019). Effects of age and hearing loss on the recognition of emotions in speech. Ear and Hearing, 40(5), 1069–1083. https://doi.org/10.1097/AUD.0000000000000694

11.

Cosme

Rosa

P. J.

Lima

C. F.

Tavares

Scott

Chen

Prata

(2021). Pupil dilation reflects the authenticity of received nonverbal vocalizations. Scientific Reports, 11(1), 1–14. https://doi.org/10.1038/s41598-021-83070-x

12.

Crosse

M. J.

Di Liberto

G. M.

Bednar

Lalor

E. C.

(2016). The multivariate temporal response function (mTRF) toolbox: A MATLAB toolbox for relating neural signals to continuous stimuli. Frontiers in Human Neuroscience, 10, 604. https://doi.org/10.3389/fnhum.2016.00604

13.

Dennis

Lockyer

Lazenby

A. L.

(2000). How high-functioning children with autism understand real and deceptive emotion. Autism, 4(4), 370–381. https://doi.org/10.1177/1362361300004004003

14.

Engelhardt

P. E.

Ferreira

Patsenko

E. G.

(2010). Pupillometry reveals processing load during spoken language comprehension. Quarterly Journal of Experimental Psychology (2006), 63(4), 639–645. https://doi.org/10.1080/17470210903469864

15.

Francis

A. L.

Love

(2020). Listening effort: Are we measuring cognition or affect, or both? Wiley Interdisciplinary Reviews: Cognitive Science, 11(1), 1–27. https://doi.org/10.1002/wcs.1514

16.

Geangu

Hauf

Bhardwaj

Bentz

(2011). Infant pupil diameter changes in response to others’ positive and negative emotions. PLoS ONE, 6(11), e27132. https://doi.org/10.1371/journal.pone.0027132

17.

Gingras

Marin

M. M.

Puig-Waldmüller

Fitch

W. T.

(2015). The Eye is Listening: Music-Induced Arousal and Individual Differences Predict Pupillary Responses. Frontiers in Human Neuroscience, 9, 619. https://doi.org/10.3389/fnhum.2015.00619

18.

Goy

Pichora-Fuller

M. K.

Singh

Russo

F. A.

(2018). Hearing aids benefit recognition of words in emotional speech but not emotion identification. Trends in Hearing, 22, 1–16. https://doi.org/10.1177/2331216518801736

19.

Granholm

Steinhauer

S. R.

(2004). Pupillometric measures of cognitive and emotional processes. International Journal of Psychophysiology, 52(1), 1–6. https://doi.org/10.1016/j.ijpsycho.2003.12.001

20.

Groh

A. M.

Roisman

G. I.

(2009). Adults’ autonomic and subjective emotional responses to infant vocalizations: The role of secure base script knowledge. Developmental Psychology, 45(3), 889–893. https://doi.org/10.1037/a0014943

21.

Hawk

S. T.

van Kleef

G. A.

Fischer

A. H.

van der Schalk

(2009). Worth a thousand words”: Absolute and relative decoding of nonlinguistic affect vocalizations. Emotion, 9(3), 293–305. https://doi.org/10.1037/a0015178

22.

Hess

E. H.

Polt

J. M.

(1964). Pupil size in relation to mental activity during simple problem solving. Science, 140, 1190–1192. https://doi.org/10.1126/science.143.3611.1190

23.

Hoeks

Levelt

W. J. M.

(1993). Pupillary dilation as a measure of attention: A quantitative system analysis. Behavior Research Methods, Instruments, & Computers, 25(1), 16–26. https://doi.org/10.3758/BF03204445

24.

Hothorn

Bretz

Westfall

(2008). Simultaneous inference in general parametric models. Biometrical Journal, 50(3), 346–363. https://doi.org/10.1002/bimj.200810425

25.

Iacozza

Costa

Andoni

(2017). What do your eyes reveal about your foreign language ? Reading emotional sentences in a native and foreign language. PLoS One, 12(10), 1–10. https://doi.org/10.1371/journal.pone.0186027

26.

Jürgens

Fischer

Schacht

(2018). Hot speech and exploding bombs: Autonomic arousal during emotion classification of prosodic utterances and affective sounds. Frontiers in Psychology, 9(February), 1–13. https://doi.org/10.17605/OSF.IO/J5Z4A

27.

Kirwan

Başkent

Wagner

A. E

. (2025). ENRICH-EMO: The Time Course of the Pupillary Response to Auditory Emotions in Pseudospeech, Music and Vocalisations, DataverseNL. https://doi.org/10.34894/CHIXMA

28.

Kotz

S. A.

Meyer

Alter

Besson

von Cramon

D.Y.

Friederici

A. D.

(2003). On the lateralization of emotional prosody: An event-related functional MR investigation. Brain and Language, 86(3), 366–376. https://doi.org/10.1016/S0093-934X(02)00532-1

29.

Kotz

S. A.

Paulmann

(2007). When emotional prosody and semantics dance cheek to cheek: ERP evidence. Brain Research, 1151(1), 107–118. https://doi.org/10.1016/j.brainres.2007.03.015

30.

Kret

M. E.

Roelofs

Stekelenburg

J. J.

de Gelder

(2013). Emotional signals from faces, bodies and scenes influence observers’ face expressions, fixations and pupil-size. Frontiers in Human Neuroscience, 7, 810. https://doi.org/10.3389/fnhum.2013.00810

31.

Kret

M. E.

Sjak-Shie

E. E.

(2019). Preprocessing pupil size data: Guidelines and code. Behavior Research Methods, 51(3), 1336–1342. https://doi.org/10.3758/s13428-018-1075-y

32.

Kuchinke

Schneider

Kotz

S. A.

Jacobs

A. M.

(2011). Spontaneous but not explicit processing of positive sentences impaired in Asperger’s syndrome: Pupillometric evidence. Neuropsychologia, 49(3), 331–338. https://doi.org/10.1016/j.neuropsychologia.2010.12.026

33.

Kuznetsova

Brockhoff

P. B.

Christensen

R. H. B.

(2017). Lmertest package: Tests in linear mixed effects models. Journal of Statistical Software, 82(13), 1–26. https://doi.org/10.18637/JSS.V082.I13

34.

Lausen

Hammerschmidt

(2020). Emotion recognition and confidence ratings predicted by vocal stimulus type and prosodic parameters. Humanities and Social Sciences Communications, 7(1), 1–17. https://doi.org/10.1057/s41599-020-0499-z

35.

Leys

Ley

Klein

Bernard

Licata

(2013). Detecting outliers : Do not use standard deviation around the mean, use absolute deviation around the median. Journal of Experimental Social Psychology, 49(4), 764–766. https://doi.org/10.1016/j.jesp.2013.03.013

36.

Lima

C. F.

Anikin

Monteiro

A. C.

Scott

S. K.

Castro

S. L.

(2019). Automaticity in the recognition of nonverbal emotional vocalizations. Emotion. Lima, César F.: Centro de Investigação e Intervenção Social, Instituto Universitário de Lisboa (ISCTE-IUL), Avenida das Forças Armadas, Lisboa, Portugal, 1649-026, cesar.lima@iscte-iul.pt: American Psychological Association. https://doi.org/10.1037/emo0000429

37.

Mathôt

(2018). Pupillometry: Psychology, physiology, and function. Journal of Cognition, 1(1), 1–23. https://doi.org/10.5334/joc.18

38.

Mathôt

Fabius

Van Heusden

Van der Stigchel

(2018). Safe and sensible preprocessing and baseline correction of pupil-size data. Behavior Research Methods, 50(1), 94–106. https://doi.org/10.3758/s13428-017-1007-2

39.

Mcdougal

D. H.

Gamlin

P. D.

(2015). Autonomic control of the eye. Comprehensive Physiology, 5(1), 439–473. https://doi.org/10.1002/cphy.c140014.Autonomic

40.

McGarrigle

Dawes

Stewart

A. J.

Kuchinsky

S. E.

Munro

K. J.

(2017). Pupillometry reveals changes in physiological arousal during a sustained listening task. Psychophysiology, 54(2), 193–203. https://doi.org/10.1111/psyp.12772

41.

Mirman

(2014). Growth curve analysis and visualization using R. Growth Curve Analysis and Visualization Using R. https://doi.org/10.1201/9781315373218

42.

Nagels

Gaudrain

Vickers

Lopes

M. M.

Hendriks

(2020). Development of vocal emotion recognition in school-age children : The EmoHI test for hearing-impaired populations. PeerJ, e8773(8), 1–14. https://doi.org/10.7717/peerj.8773

43.

Neagu

M. B.

Kressner

A. A.

Relaño-Iborra

Bækgaard

Dau

Wendt

(2023 Feb 22). Investigating the reliability of pupillometry as a measure of individualized listening effort. Trends in Hearing, 27, 23312165231153288. https://doi.org/10.1177/23312165231153288. PMCID: PMC9947699

44.

Niikuni

Wang

Makuuchi

Koizumi

Kiyama

(2022). Pupil dilation reflects emotional arousal via poetic language. Perceptual and Motor Skills, 129(6), 1691–1708. https://doi.org/10.1177/00315125221126778

45.

Oliva

Anikin

(2018). Pupil dilation reflects the time course of emotion recognition in human vocalizations. Scientific Reports, 8, 4871. https://doi.org/10.1038/s41598-018-23265-x

46.

Paquette

Peretz

Belin

(2013). The “musical emotional bursts”: A validated set of musical affect bursts to investigate auditory affective processing. Frontiers in Psychology, 4, 1–7. https://doi.org/10.3389/fpsyg.2013.00509

47.

Parbery-Clark

Anderson

Kraus

(2013). Musicians change their tune: How hearing loss alters the neural code. Hearing Research, 302, 121–131. https://doi.org/10.1016/j.heares.2013.03.009

48.

Parsons

C. E.

Young

K. S.

Craske

M. G.

Stein

A. L.

Kringelbach

M. L.

(2014). Introducing the Oxford Vocal (OxVoc) Sounds database: A validated set of non-acted affective sounds from human infants, adults, and domestic animals. Frontiers in Psychology, 5, 562. https://doi.org/10.3389/fpsyg.2014.00562

49.

Partala

Jokiniemi

Surakka

(2000). Pupillary responses to emotionally provocative stimuli. In Eye tracking research & application: Proceedings of the 2000 symposium on eye tracking research and applications. (pp. 123–129). ACMPress.

50.

Partala

Surakka

(2003). Pupil size variation as an indication of affective processing. International Journal of Human Computer Studies, 59(1–2), 185–198. https://doi.org/10.1016/S1071-5819(03)00017-X

51.

Pell

M. D.

Rothermich

Liu

Paulmann

Sethi

Rigoulot

(2015). Preferential decoding of emotion from human non-linguistic vocalizations versus speech prosody. Biological Psychology, 111, 14–25. https://doi.org/10.1016/j.biopsycho.2015.08.008

52.

Posner

Russell

J. A.

Peterson

B. S.

(2005). The circumplex model of affect: An integrative approach to affective neuroscience, cognitive development, and psychopathology. Developmental Psychopathology, 17(3), 715–734. https://doi.org/10.1037/a0017745.The

53.

Russell

J. A.

(1980). A circumplex model of affect. Journal of Personality and Social Psychology, 39(6), 1161–1178. https://doi.org/10.1037/h0077714

54.

Sauter

Eimer

(2010). Rapid detection of emotion from human vocalizations. Journal of Cognitive Neuroscience, 22(3), 474–481. https://doi.org/10.1162/jocn.2009.21215

55.

Snowden

R. J.

O’Farrell

K. R.

Burley

Erichsen

J. T.

Newton

N. V.

Gray

N. S.

(2016). The pupil’s response to affective pictures: Role of image duration, habituation, and viewing mode. Psychophysiology, 53(8), 1217–1223. https://doi.org/10.1111/psyp.12668

56.

Sonntag

G. P.

Portele

(1998). PURR - A method for prosody evaluation and investigation. Computer Speech and Language, 12(4), 437–451. https://doi.org/10.1006/csla.1998.0107

57.

van Rij

Hendriks

van Rijn

Baayen

R. H.

Wood

S. N.

(2019). Analyzing the time course of pupillometric data. Trends in Hearing, 23, 2331216519832483. https://doi.org/10.1177/2331216519832483

58.

Wagner

A. E.

Nagels

Toffanin

Opie

J. M.

Başkent

(2019). Individual variations in effort: Assessing pupillometry for the hearing impaired. Trends in Hearing, 23, 1–18. https://doi.org/10.1177/2331216519845596

59.

Weiss

M. W.

Trehub

S. E.

Schellenberg

E. G.

Habashi

(2016). Pupils dilate for vocal or familiar music. Journal of Experimental Psychology: Human Perception and Performance, 42(8), 1061–1065. https://doi.org/10.1037/xhp0000226

60.

Yang

Makita

Nakao

, et al. (2018). Affective auditory stimulus database: An expanded version of the international affective digitized sounds (IADS-E). Behavior Research, 50, 1415–1429. https://doi.org/10.3758/s13428-018-1027-6

61.

Yildirim

Bulut

Lee

C. M.

Kazemzadeh

Deng

Lee

Narayanan

Busso

(2004). An acoustic study of emotions expressed in speech. Proceedings Interspeech 2004, 2193–2196. https://doi.org/10.21437/Interspeech.2004-242

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.16 MB