Abstract
Music in three-dimensional (3D) audio formats is becoming increasingly important in many areas of the entertainment industry. However, little research has been done on the effects of various playback formats on the emotional listening experience. A study by Hahn made an important contribution to this topic. Based on a repeated measures design and using the Geneva Emotional Music Scale (GEMS), he conducted a listening experiment comparing music experiences resulting from presentations in stereo, 5.1 surround sound, and Auro-3D 9.1 (a 3D audio format containing a 5.1 surround sound layer and four height channels) reproduced by loudspeakers. Data were made available for a reanalysis. The main aims of this study were (1) analyzing listening differences between formats as measured by the original GEMS factors, (2) calculating effect sizes for a better estimation of sample sizes for future studies, and (3) making the data set available to the public. For the reanalysis, the ratings of participants were aggregated (mean values for the nine GEMS factors per audio format). There were significant differences between the formats as shown by a nonparametric MANOVA (N = 52) with the GEMS factors as dependent variables and the three audio formats as a repeated measures factor. For the GEMS factor Transcendence, an ANOVA (N = 52) revealed a large omnibus effect (ηp2 = .206) for the three formats. Pairwise contrasts showed a significant increase (small to medium effect size) in emotional experiences for the Transcendence factor from stereo to surround sound (Cohen's dZ = 0.31), surround sound to 3D audio (Cohen's dZ = 0.45), and stereo to 3D audio (Cohen's dZ = 0.64).
Since the release of the Blu-ray Disc in 2006, the three-dimensional reproduction of sound (3D audio) is becoming increasingly important for the domains of film, virtual reality, video entertainment, and computer games. The increased storage capacity of the Blu-ray Disc enabled the use of multi-channel immersive audio based on the new standards of Auro-3D (released in 2006), Dolby Atmos (released in 2012), and DTS:X (released in 2015). Over recent years, 3D audio formats (mostly reproduced by headphone binauralizations) have also become increasingly popular in the music industry and are often promoted for their assumed greater “emotional depth” compared to standard stereo reproduction (Strauß, 2020, p. 18). Although “emotional depth” is neither an established concept nor a construct, it is used to postulate that emotions felt when listening to music in 3D audio formats are more intense than those when listening to the same music in stereo format. The main difference between the aforementioned playback formats is the number of playback channels (mono < stereo < 5.1 surround sound < 3D audio), and the spatial impression of sound can be enhanced by a higher number of channels: While stereo loudspeakers can create virtual sound sources on the line between them (Geluso, 2018), a 5.1 surround sound layout is able to extend the virtual sound sources to the horizontal plane around the listener (Kim, 2018). To further enhance the spatial impression to three dimensions—by elevating sound sources—three or more height channel speakers are required (Kim, 2018). In this context, an important question is whether this increasing spatiality of sounds has an objective influence on the emotional responses of listeners. Previous studies dealt with the evaluation of multi-channel stimuli, sound systems, and spatial audio in general (Francombe, Brookes & Mason, 2017; Francombe, Brookes, Mason, & Woodcock, 2017; Rumsey, 1998; Zacharov & Pedersen, 2015). However, these studies focused mainly on the quality of sound reproduction, subjective attribution of properties, and listener preferences. No conclusions on emotional effects can be drawn from these findings. As a precondition for an evaluation of the listeners’ emotional experiences, a valid psychometric inventory for the measurement of immersive music experience is crucial. The Geneva Emotional Music Scale (GEMS; Zentner et al., 2008) is designed to measure how music makes a participant feel. To do so, participants use a five-point Likert scale to rate the extent to which the emotions they feel match an adjective. This is done using a variety of adjectives, which constitute the GEMS items (for example, “sad” or “happy”). Using factor analytical techniques, Zentner et al. (2008) derived a model with nine first-order factors, each comprising some adjective items without overlap. Because of intercorrelations between first-order factors, the model also includes three second-order factors. To the best of our knowledge, Hahn (2018) conducted the first controlled study measuring the emotional experiences of listeners in reaction to different playback formats. Using a German 27-item version of GEMS (see Appendix Table A1), Hahn (2018) measured the emotions evoked by loudspeaker playback in stereo, 5.1 surround sound, and Auro-3D 9.1—a 3D audio channel-based format consisting of a 5.1 surround sound layer and four height channels. In a complete repeated measures (RM) design, 53 participants—all in individual sessions—listened to two excerpts from Schönberg's string sextet “Verklärte Nacht” (Transfigured Night) Op. 4 in each of the three audio formats via loudspeaker and indicated the intensity of emotions felt on five-point Likert scales for each stimulus. The stimuli had durations of 74 s and 97 s and were presented in random order. The length of these stimuli exceeded the average time of 8.31 s required for an emotional judgement as reported by Bachorik et al. (2009). Using the GEMS single items as a basis for data analysis, Hahn (2018) only found slight differences in emotional experience between the three audio formats, which did not reach significance. Furthermore, he extracted three principal components from the GEMS items, which are similar to the second-order GEMS factors. Based on these three components, he reported no significant differences between the three audio formats. However, Hahn did not analyze the first-order factors of GEMS.
Our main aims for the data reanalysis were:
maintenance of the original first-order GEMS factor structure for the reanalysis (in contrast to Hahn's [2018] analyses of the individual GEMS items and three principal components extracted from the items, which target the second-order GEMS factors); calculation of effect sizes for a better estimation of the influence of playback formats on the emotional experience. The resulting effect sizes will be of particular interest for power calculations in future studies; the sustainable use of the data set by making it available to the public.
Method
For the original study by Hahn (2018), participants provided informed written consent for inclusion, collection, use, and publication of data.
Filtering the Data Set
Hahn's (2018) data set consists of N = 53 participants. Because participants were allowed to omit responses, there are missing values on the GEMS items in the data set. Since some of the scheduled analyses “must not contain missing values” (Friedrich et al., 2022, multRM documentation), we had to decide how to handle missing values when calculating the scores from the items for the GEMS factors. As a compromise between keeping the majority of the original sample and aggregated scores based on a reasonable number of individual values, the following exclusion criterion was defined: For our analyses, a participant was considered a valid case when a response was given on at least two items for each of the GEMS factors per audio excerpt. In other words, a participant was excluded if there was at least one excerpt where they left out two items of one factor. In line with this criterion, one participant had to be excluded. The remaining N = 52 participants had an age range between 16 and 69 years (M = 30.13, SD = 12.46). Seventeen participants (32.7%) indicated their gender as female, and 35 as male (67.3%). Thirty-eight out of 52 (73.1%) reported a music-related profession or course of study (predominantly sound engineers [Tonmeister] and musicians). Most of the participants stated that they listened to classical music regularly (n = 30, 57.7%) or occasionally (n = 13, 25.0%). Thirty-three participants (63.5%) were familiar with the composition used as stimulus material and 7 (3.6%) of these reported knowing it very well. Regarding 3D audio, 19 participants (36.5%) indicated that they had heard or read about it, and an additional 25 (48.1%) that they had already listened to music in a 3D audio format.
Scores for GEMS Factors
In contrast to Hahn's (2018) excerpt-based data analysis, ratings were aggregated over the two musical excerpts for the reanalysis since there was no hypothesis on a difference between the emotional impact of the two excerpts. As a result, each participant should have one score for each of the first-order GEMS factors in all three audio formats. In the first step, item responses were averaged across the two excerpts for each format. In the second step, the factor score per format was calculated as the mean of the corresponding items. Because of the filter criterion, each factor score was based on at least four items.
Data Analysis
Due to the experimental design, an RM multivariate analysis of variance (MANOVA) was applied with the three audio formats as a within-subjects factor and the GEMS factors as dependent variables. Classical MANOVA and related procedures are based on assumptions that are often not met in real data (Bathke et al., 2018; Friedrich et al., 2019). Distribution-related assumptions include multivariate normality and the absence of multivariate outliers. In the filtered data set, multivariate normality was not present for the stereo and 3D audio conditions as indicated by Mardia's test (Mardia, 1970) implemented in the MVN R package (Korkmaz et al., 2014, 2021). Furthermore, 29 participants were multivariate outliers in at least one condition based on robust Mahalanobis distances (Korkmaz et al., 2014, 2021). However, since a non-normal distribution of the responses could validly represent their underlying mechanisms and there were no distributional assumptions based on any hypothesis, exclusion of these outliers was not considered. In addition, the exclusion would have significantly reduced the data set. To circumvent issues resulting from violated theoretical preconditions (e.g., inflated type-I-errors), we used a nonparametric method with minimal assumptions regarding the data (R package MANOVA.RM, see Friedrich et al., 2019, 2022). This approach offers a Wald-type statistic and a modified ANOVA-type statistic (MATS) with different bootstrap methods for testing MANOVA models. Based on a large simulation study, Friedrich and Pauly (2018) recommend MATS in combination with parametric bootstrapping. Its wild [sic] bootstrap approach is very liberal, and its nonparametric bootstrap tends to be more conservative than the parametric version. MATS also appears to be more robust than the Wald-type statistic. Therefore, we used MATS in combination with parametric bootstrapping.
A common, but also debatable, approach following a significant MANOVA result would be to conduct individual univariate ANOVAs on each of the dependent variables (Denis, 2015, Chapter 12; Field et al., 2012, Chapter 16.5.3; Rencher & Christensen, 2012, Chapter 6). Some authors argue that a significant MANOVA protects against alpha inflation when conducting the individual ANOVAs. Others argue that this is only partially the case and therefore suggest correction of the alpha level for the univariate tests (Field et al., 2012). However, since we are interested in the largest possible effect on emotions felt that the respective audio formats can have, we will focus on only one ANOVA. The GEMS factor that showed the largest differences in emotional experiences between the audio formats was Transcendence, and was therefore considered for analysis. The ANOVA model that predicts the Transcendence score from the audio format taking into account the RM structure of the data can be formulated using common R syntax as follows:
Finally, correlations—as an additional effect size—between the audio conditions were calculated for the GEMS factor Transcendence. A mean correlation was obtained by averaging the individual Fisher z-transformed correlations and back-transforming the result to rz, as this value is less biased compared to the mean of untransformed correlations for this sample size (Corey et al., 1998).
Results
The RM MANOVA revealed significant overall differences between the formats, MATS = 17.90, p = .004 (based on parametric bootstrapping). As can be seen in Figure 1 and Table 1, the differences between the formats are largest for the Transcendence factor. Since the data for this factor fulfilled the theoretical assumptions (no extreme outliers, normality, and sphericity), a standard RM ANOVA was used, resulting in a large omnibus effect of ηp2 = .206, F(2, 102) = 13.209, p < .001, ηG2 = .044 (generalized). As can be surmised from Figure 2, pairwise contrasts for Stereo < Surround, Stereo < 3D audio, and Surround < 3D audio also became significant (all p < .05, for details, see Table 2). The respective effect sizes ranged from dZ = 0.31 to 0.64 and bias-adjusted from g = 0.24 to 0.50. CLE ranged from 62.22% to 74.05%. Besides pairwise comparisons, McGraw and Wong (1992) suggest a formula to estimate the CLE for one condition compared to several other conditions. According to their approach, the probability that a participant scored higher in 3D audio compared to both stereo and surround sound is 54.99%.

Means and confidence intervals of the nine GEMS factors for the three audio formats. Emotions felt are reported on a five-point Likert scale from 1 (Not at all) to 5 (Very much). Error bars represent 95% confidence intervals for within-subjects designs according to Cousineau and O’Brien (2014).

Error plot for the GEMS factor Transcendence. Transcendence is reported on a five-point Likert scale from 1 (Not at all) to 5 (Very much). Error bars represent 95% confidence intervals for within-subjects designs according to Cousineau and O’Brien (2014).
Means and standard deviations of all nine GEMS factors for the three audio formats.
Note. Means across N = 52 participants. Standard deviations are presented in parentheses.
Pairwise contrasts for the RM ANOVA on the GEMS factor Transcendence.
Note. SE = 0.081 and df = 102 taken from the ANOVA model for all contrasts; p values are Holm-adjusted; for details on the calculation of d and g with different subscripts, see Lakens (2013).
The scores for Transcendence in the three audio formats were positively correlated (for details see Table 3). Correlations ranged from r = .70 to .82. The averaged value resulting from the individual Fisher z-transformed correlations was 0.96. Back-transformation resulted in a value of rz = .75 for the mean correlation for the scores for Transcendence between the three audio formats.
Correlations between the GEMS Transcendence scores in the three audio formats.
Note. N = 52. Alternative hypothesis is a positive correlation. zr = Fisher z-transform.
rz (back-transformed average zr value).
Looking at the individual ANOVAs for the remaining GEMS factors, Wonder, Nostalgia, Joyful Activation, and Power showed significant differences for the uncorrected significance level of α = .05 (for details see Appendix Table A2). With a Bonferroni correction resulting in αcorrected = α / 9 = .0056, the only factor showing significant differences besides Transcendence was Wonder. Its effect size is ηp2 = .102, while the effect sizes for the seven remaining GEMS factors are below ηp2 = .063.
Discussion
The MANOVA revealed that the audio formats had an effect on the emotions felt by the participants as measured by GEMS. For the factor of Transcendence, the ANOVA and its contrast analyses confirmed that the direction of emotional increase for the formats was as hypothesized (Stereo < Surround < 3D audio, see Figure 2 and Table 2). CLEs indicated that in a pairwise comparison the audio format with the technical possibility of higher spatiality is rated higher with a probability greater than 50%. This also holds for the comparison of 3D audio against both stereo and surround sound. Following common effect size benchmarks (Ellis, 2010, p. 41), the differences between the formats ranged from small effects (d or g ≥ 0.2) up to a medium effect (d or g ≥ 0.5). The large omnibus effect of ηp2 = 0.206 (which corresponds to Cohen's f = 0.509) or the generalized effect ηG2 = 0.044 (which corresponds to Cohen's f = 0.215) along with the average correlation of rz = .75 between the emotional ratings for the three audio formats can be used as first estimates in a priori power analyses for future research designs.
These calculated effect sizes have their limitations and should be used with caution. One limitation is that they are based on a stimulus set that is limited in at least two ways. First, participants only listened to two excerpts from a single piece of classical music. The results may differ for other pieces and other genres. In addition to the musical content itself, the recording and production techniques may also affect the differences between audio formats. Auro-3D 9.1 is just one of many 3D audio formats, including, for example, higher-order ambisonics or object-based formats such as Dolby Atmos. A second limitation is that effect sizes were calculated for only one GEMS factor. GEMS aims to measure the emotions felt by listeners. Since these induced emotions are thought to be not only the result of the music but of a complex interaction between the music, the listener, and situational factors (Gabrielsson, 2001), the largest difference between the audio formats might not always apply to the factor of Transcendence. Since immersion is “characterized by […] increasing emotional involvement” (Grau, 2003, p. 13), research on immersion could be an indicator of the emotional effect of different audio formats. Against the background of more recent findings by Agrawal et al. (2022) from the audio-visual domain (the authors used excerpts from movies), there might be no significant difference between 3D audio and surround sound, or even no difference at all in the psychological experience of immersion and thus in the emotions felt.
Concerning the validity of our results, there is an overlap between the latent variables of Transcendence and Immersion, as measured by the Immersive Music Experience Inventory (IMEI; Wycisk et al., 2022), mediated through the common item “overwhelmed” in both inventories. Based on the outlier-adjusted data from Wycisk et al. (2022) considering evaluations from 190 participants to mono, stereo, and binaural 3D versions of audio excerpts from different pieces, a correlation analysis of the aggregated IMEI scores between audio formats resulted in a similar average correlation of rz = .79 (for details see Appendix Table A3). In addition, the CLE for 3D versus stereo and mono was 50.7%, which is close to the CLE for 3D audio from Hahn's (2018) data. Therefore, the effect sizes might be applicable in a broader context than the limitations imply at first sight.
Finally, the majority of Hahn's sample had a professional musical background. Thus, it might be assumed that findings could differ from the more general population. However, there is currently no evidence that at least strong emotional experiences of music (in terms of physiological reactions such as chills) differ between musicians and non-musicians. For example, Grewe et al. (2009) showed that the number of chills perceived is not linked to the level of music education, age, or gender. We cannot exclude a more general effect of musical sophistication on the psychological rating of emotional experiences, but this should be based on a more differentiated approach to musical skills as offered by the Goldsmiths Musical Sophistication Index (Müllensiefen et al., 2014).
To summarize, our data reanalysis not only offers a reliable overall effect size for future power calculations for the planning of perceptual studies on immersive listening experiences but also allows specified effect sizes for pairwise comparisons of audio formats. Future work on the emotional effect of audio formats should investigate a wider range of stimulus material, both in terms of the musical content and various 3D audio formats, with respect to a more general audience.
Footnotes
Acknowledgment
We are indebted to Ephraim Hahn for making the original data available to us and giving permission for the publication of the data set.
Author Note
Portions of these findings were presented in a preliminary version as a poster at the 2022 Jahrestagung der Deutschen Gesellschaft für Musikpsychologie [Annual Conference of the German Society for Music Psychology], Würzburg, Germany.
Action Editor
Markus Neuwirth, Anton Bruckner Privatuniversität für Musik, Schauspiel und Tanz, Institut für Theorie und Geschichte, Linz, Austria.
Peer Review
Sarvesh Rajesh Agrawal, Bang and Olufsen, Research
One anonymous reviewer
Contributorship
KS and YW researched the literature. RK obtained the data and permission for reanalysis and publication. KS and RK were involved in data analysis. KS and YW wrote the first draft of the manuscript. All authors reviewed and edited the manuscript and approved the final version of the manuscript.
Data Availability
Declaration of Conflicting Interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Ethical Approval
The original study by Hahn (2018) was conducted in Germany, where external ethical approval in psychological research is not mandatory and only required in specific cases. Such cases include (a) the expectation that participants take risks, (b) when deliberately not informing participants about the study procedure, or (c) when stimulating participants physically (Deutsche Forschungsgemeinschaft [DFG], 2023).
The presented reanalysis of the data set from the original study did not require ethics committee or IRB approval. The reanalysis did not involve the use of personal data, fieldwork, or experiments involving human or animal participants, or work with children, vulnerable individuals, or clinical populations.
Funding
The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by a research grant from “Niedersächsisches Vorab,” a joint program funded by the Volkswagen Foundation in conjunction with the Lower Saxony Ministry for Science and Culture (funding reference: ZN3497) awarded to the third author.
Appendix
Correlations between the IMEI scores in mono, stereo, and binaural 3D audio.
| 95% CI | |||||
|---|---|---|---|---|---|
| Formats | r | zr | p | LL | UL |
| Mono – Stereo | .789 | 1.068 | < .001 | 0.739 | 1.000 |
| Mono – 3D | .675 | 0.820 | < .001 | 0.605 | 1.000 |
| Stereo – 3D | .873 | 1.346 | < .001 | 0.841 | 1.000 |
| Average | .779 a | 1.078 | |||
Note. N = 190. Alternative hypothesis is a positive correlation. zr = Fisher z-transform.
rz (back-transformed average zr value).
