Sage Journals: Discover world-class research

Abstract

This study investigates how auditory, visual, and lyrical features in music videos shape emotional responses, using EEG data from 26 participants. Five widely viewed music videos were selected based on their global popularity and cross-cultural appeal. A CNN–LSTM model classified emotional states with 97.67% accuracy, and complementary regression results showed strong generalization (best model RMSE = 0.15, MAE = 0.10). Feature selection reduced 27 candidates to a sparse set dominated by auditory and visual cues, and SHAP interpretation revealed a clear modality hierarchy: pitch- and dynamics-related auditory features accounted for the largest share of predictive importance, visual color properties (hue, saturation) provided secondary influence, and lyrical sentiment contributed least. These findings support neuroaesthetic accounts in which low-level sensory structure drives rapid affective appraisal, with visual tone refining emotional meaning. Practically, the results suggest that deliberate control of pitch/dynamics can reliably steer emotional engagement in music-video creation for diverse audiences.

Keywords

multimodal emotion recognition music videos EEG auditory features emotion processing SHAP

Introduction

The relationship between art and emotional well-being has become a central focus in empirical aesthetics, cognitive science, and health psychology. Engagement with diverse art forms—visual, auditory, or multimodal—has consistently been linked to positive emotional states, improved cognitive functioning, and enhanced psychological well-being (Susino et al., 2025). Art offers a unique mode of emotional engagement that supports mood regulation and can alleviate stress and anxiety (Castellotti et al., 2025; Hunter et al., 2010; Mastandrea et al., 2019). Music, in particular, evokes strong emotional responses that correspond with physiological indicators such as heart rate variability and EEG patterns, which serve as biomarkers of emotional health (Schreuder et al., 2016; Thompson & Quinto, 2012).

Recent work in multimodal emotion recognition (MER) shows that the integration of auditory, visual, and lyrical cues—especially in audiovisual formats like music videos—produces richer and more complex emotional experiences than any single modality alone (Baltrusaitis et al., 2019; Han et al., 2022). Because emotional engagement with such multimodal stimuli contributes to both immediate affective responses and longer-term psychological benefits, understanding the mechanisms through which art induces emotion is essential for advancing research on art and well-being.

Despite this progress, significant gaps remain. Empirical studies seldom isolate how specific low-level features of artistic experiences—such as pitch, dynamics, hue, or textual sentiment—shape emotional and cognitive outcomes. Existing models often focus on broad emotional categories or rely on static self-reports, which fail to capture the dynamic nature of emotional responses (Kim & Provost, 2016) and can result in inconsistencies between human judgments and machine predictions (Muszynski et al., 2021; Pandeya et al., 2021b). Recent findings highlight the importance of capturing real-time fluctuations in emotional salience, as continuous emotional engagement appears critical for fostering long-term psychological resilience (Liu et al., 2026; Schreuder et al., 2016).

To address these gaps, the present research investigates the cognitive and emotional impact of art within an empirical aesthetics framework, using music videos as a naturalistic, multimodal stimulus. Although art clearly shapes emotional experience, the pathways linking emotional engagement to cognitive and psychological well-being remain insufficiently understood. This study bridges that gap by integrating neurophysiological data (EEG) with computational analyses of auditory, visual, and textual features to identify which elements most strongly induce emotion and how emotional salience shifts over time across modalities. EEG's high temporal resolution enables fine-grained tracking of dynamic emotional states, providing new insights into how music videos evoke emotion in real-world contexts. This multimodal, temporally sensitive approach advances the development of modality-aware MER models and offers implications for the therapeutic use of art, the design of emotionally supportive media environments, and the creation of affect-sensitive technologies that better adapt to users’ emotional experiences.

Literature Review

Perceived and Induced Emotions

Music videos are a powerful medium for conveying and eliciting emotional experiences, which is a key factor behind their broad appeal (Kallinen & Ravaja, 2006). In affective science, a distinction is made between perceived emotions—those expressed or signaled by the music video—and induced emotions, which refer to the actual emotional states experienced by the audience (Fan et al., 2017). This distinction is crucial for understanding the emotional processing of music videos, as it helps differentiate between the emotional tone of the content and the emotional responses triggered by the viewer's engagement.

Perceived emotions are typically inferred from multimodal cues, such as auditory features (e.g., tempo, intensity, pitch), visual elements (e.g., facial expressions, gestures, scene dynamics), and semantic/textual content like lyrics. In contrast, induced emotions are measured through physiological responses (e.g., EEG, heart rate variability) and behavioral indicators, which reflect the internal emotional states triggered by the stimuli (Tsai et al., 2015). While self-reports are commonly used to measure induced emotions, they can be biased and fail to capture the fluctuating emotional responses during video consumption (Soleymani et al., 2012). EEG-based approaches, however, offer a promising alternative by providing high-resolution measures of emotional states that directly correlate with audiovisual stimuli (Jenkins et al., 2009). With its high temporal resolution, EEG is particularly well-suited for studying the dynamic changes in emotional responses to stimuli (Alarcão & Fonseca, 2019).

Although perceived and induced emotions are often related, their relationship is complex. While perceived emotions generally have a higher magnitude, changes in perceived emotions tend to strongly correlate with changes in induced emotions (Thompson & Quinto, 2012). For instance, music with different emotional valences can influence EEG activity and align with perceived emotional states (Plourde-Kelly et al., 2021). However, other studies indicate that perceived and induced emotions may diverge. For example, Dibben (2004) found that the perceived emotional tone of music remained stable across contexts, even as induced emotions fluctuated. This highlights the dynamic and context-sensitive nature of emotional induction, suggesting that temporal, personal, and situational factors influence induced emotional responses, which may not always align with the perceived emotional content (Kallinen & Ravaja, 2006).

In the broader context of emotional computing and sentiment analysis, this variability underscores the need to improve the precision of sentiment recognition and address the so-called semantic gap between media content and user affective experience. Recent research emphasizes the value of incorporating user features, such as emotional predispositions captured through EEG or eye-tracking, into models of image and video sentiment recognition to enhance interpretive accuracy and real-world applicability (Liang et al., 2024). These approaches help bridge the divide between content-level analysis and the lived emotional experiences of users, offering a more personalized and dynamic understanding of emotion.

To model these complex emotional dynamics more effectively, scholars have increasingly adopted dimensional models of affect, which conceptualize emotions along continuous scales. Two key dimensions—valence (positive to negative) and arousal (high to low activation)—are widely used in the study of music and media-induced emotion (Cespedes-Guevara & Eerola, 2018; Wang et al., 2024a). This study adopts Russell's circumplex model of affect as its analytical framework, scoring emotional responses along these two axes. Guided by this model, we investigate how distinct unimodal features—auditory, visual, and semantic—contribute to induced emotional experiences. In doing so, this research addresses a critical gap in existing literature, which has predominantly focused on perceived emotional cues, by emphasizing the dynamic and individualized nature of induced emotional states during real-world multimedia engagement.

Multimodal Emotion Recognition

Emotions elicited by music videos are inherently multimodal, arising from the complex interplay of auditory, visual, and textual stimuli. Recent advancements in multimodal emotion recognition (MER) have capitalized on this integration, improving the accuracy of emotional predictions by employing architectures such as late fusion, hybrid fusion, residual-fusion models, or more sophisticated approaches like multi-stage fusion networks designed for modality collaboration (Li & Zhao, 2023). These approaches extract distinct features from sound, visuals, and lyrics and subsequently combine them to estimate emotional intensity and valence (Baltrusaitis et al., 2019; Han et al., 2022).

Within this domain, two core challenges emerge: feature extraction and cross-modal fusion. In terms of feature extraction, a wide range of deep learning techniques have been applied, including Recurrent Neural Networks (RNN), Long Short-Term Memory (LSTM) networks, Gated Recurrent Units (GRU), Convolutional Neural Networks (CNN), and attention mechanisms—all of which aim to capture the dynamic and hierarchical structure of multimodal emotional signals (Wang et al., 2024a). For cross-modal fusion, integration strategies are typically classified into three categories: feature-level fusion, decision-level fusion, and hybrid fusion. In most frameworks, unimodal features are first extracted independently and then combined to generate a unified emotional prediction (Han et al., 2022). Empirical findings consistently demonstrate that multimodal models significantly outperform unimodal systems in affective recognition tasks, highlighting the value of integrating complementary cues across modalities (Rouast et al., 2021).

Modality-Specific Contributions

Research across modalities suggests that each sensory channel contributes uniquely to emotional perception. In the auditory modality, emotional cues are conveyed through features such as timbre, rhythm, pitch, and dynamics (Fu et al., 2011; Zhang, 2021). These features can strongly influence perceived arousal and emotional intensity. For instance, Geringer et al. (1996) found that while the addition of visuals slightly enhanced emotional engagement, audio alone exerted a stronger influence on the overall affective experience.

In the visual modality, emotion is communicated through facial expressions, movements (scene-change speed), and visual aesthetics such as lighting and color (Lee et al., 2017). Facial expressions, in particular, provide immediate emotional signals—such as a smile indicating happiness or wide eyes signaling surprise (Bhattacharya et al., 2021). Body movements and gestures have also been shown to modulate the perceived emotional tone of accompanying music (Krahé et al., 2015; Vines et al., 2011; Vuoskoski et al., 2016). These visual cues often serve to amplify or reshape the emotional interpretations initiated by audio or text.

The textual modality, especially lyrics, plays a distinct role by conveying higher-order semantics and emotional meaning. Lyrics can anchor the narrative of a music video, frame emotional context, and influence perceived valence (Wang et al., 2024b). In multimodal sentiment analysis, text is often found to be the most predictive modality, especially for valence estimation (Poria et al., 2023). Supporting this, Lavy (2001) demonstrated that supplementary textual information, such as a short story presented alongside music, could alter the listener's emotional evaluation of the auditory experience. This highlights the complementary role of lyrics in augmenting the emotional content conveyed by sound and visuals.

Modeling Multimodal Emotional Integration

Despite the growing body of research in multimodal emotion recognition, most existing models prioritize predictive accuracy or rely on fusion strategies that often obscure the distinct contributions of individual modalities. Such approaches offer limited insight into the underlying mechanisms of emotional processing, as they tend to treat modalities as interchangeable or additive rather than examining how each uniquely contributes to the emotional experience. Emerging evidence, however, underscores the persistence of modality-specific effects even in integrated contexts. For example, Bhattacharya et al. (2021) demonstrated that auditory features (e.g., in A and A + V conditions) are more effective in predicting arousal, while textual features (e.g., T, A + T, V + T) better predict valence. These findings suggest that emotional information is encoded and conveyed differently across modalities, challenging assumptions of uniform integration.

EEG-based studies have provided valuable insights into how the brain differentially processes emotional cues across sensory channels. Desai et al. (2024) found that neural response patterns during audiovisual presentations closely resembled those elicited under unimodal conditions, indicating that the brain integrates multimodal information while retaining modality-specific representations. However, they also noted a tendency for visual evoked potentials (such as early P1 or N1 components, or specific occipital alpha/beta desynchronization patterns) to dominate in EEG recordings, potentially masking the contributions of auditory or semantic inputs, like narrative understanding beyond direct lyrical content (Desai et al., 2024). This dominance of visual information is consistent with earlier foundational findings. Mehrabian and Ferris (1967) reported that, in emotion communication, visual cues were approximately 1.5 times more influential than auditory ones. Subsequent studies have further confirmed this asymmetry. Schreuder et al. (2016) observed that in audiovisual contexts, emotionally congruent visual stimuli enhanced positive affective appraisals more effectively than audio alone. Similarly, Coutinho and Scherer (2017) reported strong correlations between unimodal and multimodal evaluations, emphasizing the weight of visual information in shaping emotional judgments during performance assessment.

Previous research on emotional processing has primarily focused on modality-level dynamics, often relying on controlled experimental stimuli. This approach overlooks the potential of computational modeling using rich, real-world data. Moreover, much of the existing literature examines emotions at a broader modality level (e.g., auditory, visual, textual), neglecting the finer-grained feature-level interactions within each modality. To address these gaps, this study employs naturalistic EEG data and computational modeling to explore emotional integration in a more detailed and ecologically valid manner. The key objectives of this research are to examine how auditory, visual, and textual features contribute to emotional responses in naturalistic settings (such as music video consumption) and to explore the dynamic interactions of feature-level cues from these modalities and their influence on emotional states, as measured by EEG. By addressing these objectives, the study aims to provide deeper insights into how multimodal cues shape emotional experiences and offer a more ecologically valid framework for multimodal emotion recognition.

Method

Material Preparation

Five high-quality music videos (MVs)– Let It Go, Sugar, See You Again, Stay, and Because of You—were selected from YouTube, Bilibili, and TikTok based on clearly defined empirical and methodological criteria. First, we chose videos with exceptionally high global view counts, as popularity is a reliable proxy for cultural reach, emotional resonance, and familiarity, all of which strongly influence the robustness and consistency of affective responses in empirical studies (Liikkanen & Salovaara, 2015; Schubert, 2004). Second, we selected MVs that demonstrated substantial cross-platform presence and audience diversity, ensuring cross-cultural validity and reducing the likelihood that emotional reactions would be driven by culturally specific cues (Lee et al., 2024; Susino et al., 2025). Such widely recognized stimuli are commonly recommended in affective neuroscience and empirical aesthetics because they enhance comparability across participants and support stable neural and behavioral responses (Koelsch, 2014).

Beyond popularity and cultural reach, these MVs were chosen because they exhibit clear narrative structure, strong audiovisual integration, and emotionally expressive musical features, all of which are known to reliably elicit affective engagement without overwhelming cognitive load (Liao et al., 2020). Their multimodal richness—combining music, lyrics, and visual storytelling—provides the necessary complexity for studying multimodal emotional processing while maintaining ecological validity. Collectively, these characteristics make the selected MVs well aligned with the study's goal of examining how auditory, visual, and lyrical features jointly shape emotional responses.

To provide a clearer understanding of the experimental workflow, a flowchart is included to visually represent the sequence of steps involved in the study. This flowchart outlines the process from the selection of music videos, through the extraction of multimodal features, to the analysis of the emotional responses (see Figure 1).

Figure 1.

Workflow for feature selections, extraction, analysis, and explanation.

Material Multimodal Analysis

To comprehensively investigate emotional processing in music videos, a multimodal analysis framework was adopted, incorporating auditory, visual, and textual information. Auditory features were extracted to capture emotional cues conveyed through sound characteristics such as pitch, timbre, rhythm, and dynamics, which are known to influence perceived arousal and emotional intensity. Rather than relying on low-level features obtained through signal processing techniques like Fourier transform, spectral or cepstral analysis, and autoregressive modeling—methods that are not directly related to the intrinsic properties of music as perceived by human listeners (Fu et al., 2011; Song et al., 2012)—this study focused on features with semantic content. These high-level features provide meaningful insights that better explain the emotional impact of music due to their closer alignment with human understanding (Fu et al., 2011; Song et al., 2012). Visual features were analyzed to assess emotional signals transmitted through color composition, movement dynamics, and facial expressions, reflecting the visual modality's role in shaping and amplifying emotional engagement. Additionally, the lyrical content of each music video was subjected to sentiment analysis to evaluate the semantic and affective dimensions conveyed through text, recognizing its distinct contribution to emotional meaning and valence perception. This multimodal approach allowed for a comprehensive assessment of how sensory and semantic information interact to shape emotional experiences during music video viewing (Table 1).

Table 1.

Extracted Features for Audio, Visual, and Lyrics of the Current Study.

Dimension	Feature	Explanation	Reference
Pitch	Fundamental Frequency (F0)	Measures the fundamental frequency of sound, representing perceived pitch.	Juslin and Laukka (2003)
	Pitch Contour (PC)	Tracks how pitch changes over time, representing melodic movement.	Scherer and Zentner (2008)
	Pitch Class Profile (PCP)	Represents the distribution of pitches across the 12 pitch classes.	Ong et al. (2006)
	Harmonic Pitch Class Profile (HPCP)	Enhanced profile that captures harmonic relationships in pitch.	Ong et al. (2006)
	Melodic Interval Patterns (MIP)	Analyzes the interval relationships between consecutive notes.	Vos and Troost (1989)
Timbre	Brightness/Darkness (BD)	Evaluates the perceived brightness of sound based on harmonic content.	Song et al. (2012)
	Harmonic Content (HC)	Measures the presence of harmonics, indicating tonal complexity.	Bello and Pickens (2005)
	Spectral Centroid (SCE)	Tracks the brightness of sound based on the centroid of its spectrum.	Song et al. (2012)
	Spectral Flatness (SF)	Assesses how noise-like a sound is by examining spectral flatness.
	Spectral Roll-off (SR)	Determines the point where high frequencies are cut off.
	Spectral Contrast (SCO)	Quantifies the contrast between harmonic and non-harmonic regions.	Jiang et al. (2002)
	Zero-Crossing Rate (ZCR)	Measures the rate at which audio waveforms cross zero amplitude.	Song et al. (2012)
Rhythm	Onset Density (OD)	Calculates the number of sound onsets per unit time.	Elowsson and Anders (2013))
	Rhythm Regularity (RR)	Assesses the consistency and predictability of rhythmic patterns.	Cui et al. (2010)
	Fluctuation Peak (FP)	Tracks the peak of energy fluctuations within a time window.	Song et al. (2012)
	Fluctuation Centroid (FC)	Calculates the center of energy fluctuations, measuring rhythmic balance.	Song et al. (2012)
	Event Density (ED)	Measures the density of rhythmic events within micro-temporal windows.	Deng and Leung (2015)
Dynamics	Loudness (LOU)	Tracks variations in sound intensity, reflecting perceived volume.	Ansani et al. (2021)
	RMS Energy (RE)	Evaluates overall loudness using Root Mean Square energy.	Song et al. (2012)
	Slope (SLO)	Tracks the rate of volume changes over time.
	Low Energy (LOW)	Calculates the percentage of frames with low energy.
Visual	Hue (HUE)	Captures the dominant color wavelength in the frame.	Cui et al. (2010)
	Saturation (SAT)	Measures the intensity or purity of color, affecting emotional vividness and arousal.
	Value (VAL)	Assesses the brightness level of colors, contributing to perceived mood.
	Movement (MOV)	Quantifies the motion dynamics within the video, influencing viewer engagement.	Pandeya et al. (2021a)
	Facial Expression (FE)	Analyzes the facial expressions in the video to infer displayed emotional states.	Pandeya et al. (2021a)
Lyrics	Sentiment (SEN)	Evaluates the emotions conveyed through lyrical content using sentiment analysis.	(Pandeya et al., 2021a)

EEG Emotion Analysis

Grounded in contemporary cognitive neuroscience research, this study explores the relationship between brain activity and emotional processing during music video exposure. Electroencephalography (EEG) offers a reliable and efficient method for capturing emotional responses, providing high temporal resolution and sensitivity to neural dynamics underlying affective states (Papez, 1937).

Data Source and Preprocessing

EEG training data were sourced from the SEED dataset (Zheng et al., 2019), specifically designed to study neural correlates of emotion during multimedia exposure. Standard feature extraction methods were applied to segment the EEG signals into canonical frequency bands—delta, theta, alpha, beta, and gamma—each associated with distinct cognitive and emotional processes (Papez, 1937; Tian et al., 2020; Zheng et al., 2019). Frequency-domain features were obtained via power spectral density (PSD) analysis, while time-domain features were captured using the short-time Fourier transform (STFT) (Chen et al., 2019). Additionally, differential entropy features were extracted to characterize the nonlinear dynamics of the EEG signal (Du et al., 2021).

During preprocessing, standard filtering and Independent Component Analysis (ICA) were applied to remove artifacts, followed by segmentation of the EEG signals into temporal windows (e.g., 0.5 s) to capture dynamic changes. These segments were then used for frequency band energy extraction via Fourier transform. The signals were decomposed into five frequency bands—delta (0.5–4 Hz), theta (4–8 Hz), alpha (8–14 Hz), beta (14–30 Hz), and gamma (30–50 Hz)—each associated with specific emotional or cognitive states, such as alpha for relaxation and beta and gamma for heightened arousal or attention. To improve cross-subject generalization, domain adaptation strategies and data augmentation techniques, such as time-frequency perturbations and pseudo-labeling, were applied. These methods preserved physiological signal characteristics while enhancing model robustness.

Following feature extraction, a feature matrix was constructed, and two additional preprocessing steps were applied: a Kalman filter for temporal smoothing and StandardScaler normalization to standardize feature distributions. These steps ensured appropriate feature scaling and minimized noise prior to model training. This comprehensive pipeline enabled the SEED dataset to provide high-quality EEG data for emotion recognition and serve as a benchmark for deep learning applications in EEG-based affective computing.

Participants and Experimental Design

Twenty-six native Chinese speakers (19 females, 7 males; mean age = 20.65 years, SD = 1.77) participated in the experiment. All participants were undergraduate students recruited from a major university in western China. They reported no history of neurological or psychiatric disorders, were right-handed, and had normal or corrected-to-normal vision. In addition to basic demographic information, participants were asked to report their prior exposure to music videos and general familiarity with popular music. Because the selected stimuli were globally circulated and widely recognized, and the participant group belonged to a relatively homogeneous age cohort, overall familiarity with this type of content was ensured at the group level. However, no formal assessment of artistic training or expertise was collected, as the study focused on naturalistic emotional responses rather than expertise-related differences. Likewise, no psychometric questionnaires (e.g., anxiety, depression, personality scales) were administered. This decision was made to minimize participant burden and to avoid introducing additional variability unrelated to the primary aim of examining multimodal feature contributions to emotion; however, we acknowledge that psychological traits may modulate emotional responses and address this as a limitation in Section 5.2.

Prior to participation, written informed consent was obtained in accordance with ethical guidelines approved by the university's ethics committee. Participants were seated 135 cm from a monitor and wore headphones during the experiment. Music videos were presented in randomized order to counterbalance potential order effects. Participants first completed a demographic questionnaire and consent form before beginning the task. Upon completion of the session, participants were thanked and compensated with approximately 10 USD.

EEG data were recorded using a SynAmps2 amplifier (NeuroScan, Charlotte, NC, USA) with a 64-channel Ag/AgCl electrode cap arranged according to the extended 10–20 international system. The ground electrode was placed at AFz, and recordings were referenced online to the nasal tip, with offline re-referencing to the averaged mastoid electrodes (M1, M2). Vertical electrooculogram (VEOG) signals were collected using bipolar electrodes placed above and below the left eye to monitor ocular artifacts. Data were sampled at 1000 Hz with 24-bit resolution and a bandwidth of 0.03–100 Hz.

To ensure high-quality signal acquisition and facilitate emotion recognition, the recording setup and preprocessing procedures were informed by the protocols established in the SEED dataset. Bandpass filtering was applied within the 0.5–50 Hz range to eliminate low-frequency drift and high-frequency noise, retaining the frequency components most relevant to cognitive and affective processes. ICA was subsequently employed to identify and remove artefacts related to ocular and muscular activity, thus enhancing signal purity.

In alignment with SEED preprocessing methods, the EEG data were segmented into temporal windows and decomposed into five canonical frequency bands—delta (0.5–4 Hz), theta (4–8 Hz), alpha (8–14 Hz), beta (14–30 Hz), and gamma (30–50 Hz)—each of which has well-established associations with emotional and cognitive states. Delta activity reflects motivational and affective intensity, theta is associated with emotional arousal and memory-related processing, alpha suppression indicates increased emotional engagement and attentional demands, beta reflects cognitive evaluation and emotional intensity, and gamma is implicated in the integration of complex emotional information (Aftanas & Golocheikine, 2001; Codispoti et al., 2023; Knyazev, 2012). These theoretical and empirical links justify the selection of these bands for emotion recognition.

Channel-wise z-score normalization was performed to reduce inter-individual variability and support model generalization. Although auxiliary data such as eye-tracking and behavioral responses were recorded, this study focuses exclusively on EEG signals to explore the direct neural correlates of emotion. The rigorous acquisition and preprocessing procedures ensure that the extracted features are both physiologically meaningful and suitable for machine learning-based affective state classification.

Model Construction

To predict emotional responses from EEG, we developed a custom CNN-LSTM model for integrated spatial and temporal feature extraction. This hybrid architecture was selected to leverage the complementary strengths of Convolutional Neural Networks (CNNs) in extracting high-level spatial invariant features from frequency bands and Long Short-Term Memory (LSTM) networks in capturing temporal dependencies inherent in dynamic emotional processing (Alhagry et al., 2017; Chuankun et al., 2017). The architecture processes input EEG through two CNN blocks: the first with a Conv1D layer (filters=64, kernel_size=5, activation = ReLU), MaxPooling1D (pool_size=2), and Dropout (0.5); the second with a Conv1D layer (filters=128, kernel_size=3, activation = ReLU), MaxPooling1D (pool_size=2), and Dropout (0.5). The use of ReLU activation ensures non-linearity while mitigating the vanishing gradient problem, facilitating the learning of complex patterns (Nair & Hinton, 2010). This output feeds an LSTM layer (units=128, dropout=0.3, return_sequences = False). Finally, two Dense layers (units=64, activation='relu’; then units=1, activation='linear’) and a concluding Dropout (0.5) produce the output. During training on an appropriate train/validation split (80%/20%), the Adam optimizer was used. Hyperparameters (as detailed in Appendix A) were optimized via grid search on the validation set, and overfitting was mitigated using early stopping and the specified dropout layers (0.3, 0.5).

To assess the effectiveness of this hybrid architecture, we conducted ablation experiments comparing the full CNN + LSTM model against several baselines, including CNN-only, LSTM-only, and traditional machine learning models such as Support Vector Machines (SVM). The results, summarized in Table 2, show that our CNN + LSTM model achieved a precision of 97.67%, outperforming both its individual components and previously reported models. This demonstrates the efficacy of combining spatial and temporal modeling techniques for EEG-based emotion recognition.

Table 2.

Comparison of Different Models for EEG Emotion Score Prediction.

Models	Precision	Reference
SVM	87.41%
Ensemble CNN	93.12%	Chen et al. (2019)
ResNet	94.95%	Du et al. (2021)
LSTM	90.12%	Bai et al. (2020)
BiLSTM	84.21%	Yang et al. (2020)
SRU	83.13%	Wei et al. (2020)
Ours	97.67%	\

Analysis Process

The detailed analysis procedure was as follows:

Feature Selection: To identify key predictive features from the multimodal dataset (audio, visual, and lyrical features), the least absolute shrinkage and selection operator (LASSO) regression analysis was conducted using R software (glmnet 4.1.2). Given the high dimensionality of multimodal features, LASSO is particularly effective as it introduces an L1 regularization penalty that shrinks coefficients of less informative variables to zero (Tibshirani, 1996). This approach rigorously manages model complexity and addresses the issue of multicollinearity often present in audiovisual datasets, ultimately retaining the most relevant predictors for emotional responses.

Data Partitioning: The full dataset was randomly divided into a training set and a testing set at an 8:2 ratio using Python's random number generator. This division ensured that model evaluation would be based on previously unseen data to accurately assess generalization performance.

Model Development and Comparison: Multiple regression models were constructed to predict emotional responses from multimodal features. These models included LSTM, 1D CNN, XGBoost, MLP, Random Forest (RF), Bayesian Ridge Regression, Gradient Boosting Regressor, Support Vector Regressor (SVR), CatBoost, and Transformer-based architectures. Each model was trained and evaluated over 10 repeated random sampling iterations to ensure robustness. Model performance was assessed across both training and testing sets to identify the best-performing architecture.

Model Training, Validation, and Testing: For the optimal model, 10-fold cross-validation was conducted within the training set to fine-tune parameters and ensure stable performance estimates. Final evaluation was performed on the testing set using standard regression metrics, including R-squared (R²), Root Mean Squared Error (RMSE), and Mean Absolute Error (MAE).

Model Interpretation: To overcome the opacity often associated with deep learning models, we applied the SHAP (SHapley Additive exPlanations) values were calculated using the SHAP Python package (0.39.0). Grounded in cooperative game theory, SHAP assigns each feature an importance value for a particular prediction, offering a unified measure of feature attribution that is both consistent and locally accurate (Lundberg & Lee, 2017). Global feature importance was visualized to highlight the most influential predictors across the dataset. In addition, SHAP explanations were generated for individual samples to illustrate how specific feature values contributed to predicted emotional responses at the instance level.

Results

Descriptive Analysis

To provide an overview of the dataset characteristics, a descriptive analysis was conducted on all key variables. This analysis included measures of central tendency (mean) and variability (standard deviation) for each multimodal feature and the corresponding emotional response scores (See Table 3).

Table 3.

Descriptive Analysis of all Features in the Current Study.

Dimensions	Features	Unit	Mean + SD
Pitch	Fundamental Frequency (F0)	Hz	181.29 ± 198.04
	Pitch Contour (PC)	Hz	181.34 ± 197.83
	Pitch Class Profile (PCP)	–	0.29 ± 0.13
	Harmonic Pitch Class Profile (HPCP)	–	0.14 ± 0.22
	Melodic Interval Patterns (MIP)	–	−5.42e + 9 ± 1.50e + 10
Timbre	Brightness/Darkness (BD)	Hz	1320.47 ± 938.26
	Harmonic Content (HC)	–	1136.47 ± 844.02
	Spectral Centroid (SCE)	–	1320.47 ± 938.26
	Spectral Flatness (SF)	–	0.08 ± 0.25
	Spectral Roll-off (SR)	Hz	2477.15 ± 2465.70
	Spectral Contrast (SCO)	–	13.86 ± 4.62
	Zero-Crossing Rate (ZCR)	Events/second	0.03 ± 0.02
Rhythm	Onset Density (OD)	Events/second	5.41 ± 5.24
	Rhythm Regularity (RR)	–	0.49 ± 0.02
	Fluctuation Peak (FP)	–	0.01 ± 0.01
	Fluctuation Centroid (FC)	–	0.01 ± 0.01
	Event Density (ED)	Events/second	0.31 ± 0.34
Dynamics	Loudness (LOU)	dB	0.05 ± 0.04
	RMS Energy (RE)	–	0.05 ± 0.04
	Slope (SLO)	–	0.01 ± 0.01
	Low Energy (LOW)	–	0.42 ± 0.34
Visual	Hue (HUE)		58.60 ± 15.40
	Saturation (SAT)	–	87.56 ± 25.64
	Value (VAL)	–	81.82 ± 31.37
	Movement (MOV)	–	4.95 ± 4.18
	Facial Expression (FE)	–	0.02 ± 0.21
Lyrics	Sentiment (SEN)	–	0.17 ± 0.43
EEG	EEG emotion scores (EEG)	–	0.34 ± 0.39

Feature Selection

LASSO regression analysis was performed on all independent variables, using the presence of the EEG emotion score as the dependent variable (Figure 2). LASSO, which is effective at compressing variable coefficients to prevent overfitting and addressing multicollinearity issues, identified the optimal model with a minimum mean squared error at λ\lambdaλ (log value ≈ −3.3). As a result, 27 independent variables were reduced to 7 key predictors: Harmonic Pitch Class Profile (HPCP), RMS Energy (RE), Low Energy (LOW), Hue (HUE), Spectral Contrast (SCO), Event Density (ED), Pitch Contour (PC), Loudness (LOU), and Saturation (SAT).

Figure 2.

LASSO regression analysis was employed to select characteristic factors. (A) Ten-fold cross-validation was conducted to determine the optimal penalty parameter λ\lambdaλ, with vertical lines indicating selected values. The optimal λ\lambdaλ resulted in 7 nonzero coefficients. (B) Coefficient profiles of the 27 features were plotted against the log(λ\lambdaλ) sequence. Vertical dotted lines indicate the λ\lambdaλ corresponding to the minimum mean squared error (MSE) (λ\lambdaλ ≈ 0.035) and one standard error from the minimum (λ\lambdaλ ≈ 0.106).

Modelling Analysis

Multiple regression models—including LSTM, 1D CNN, XGBoost, MLP, Random Forest (RF), Bayesian Ridge Regression, Gradient Boosting Regressor, Support Vector Regressor (SVR), CatBoost, and Transformer-based architectures—were trained and evaluated. Each model underwent 10 repetitions to ensure result stability. Model performance was assessed using root mean square error (RMSE) and mean absolute error (MAE) metrics. As shown in Table X and Figure 2a,b, XGBoost (RMSE = 0.15, MAE = 0.10), 1D CNN (RMSE = 0.15, MAE = 0.11), LSTM (RMSE = 0.15, MAE = 0.11), and MLP (RMSE = 0.15, MAE = 0.11) demonstrated superior predictive performance relative to the other models across both the training and testing sets (see Figure 3).

Figure 3.

Training and testing performance of the four top-performing machine learning models in predicting EEG emotion scores. Red dots represent the training set, and blue dots represent the testing set: (A) 1D CNN, (B) LSTM, (C) XGBoost, and (D) MLP.

Among these, XGBoost achieved the lowest RMSE and MAE, indicating its strong ability to generalize and accurately capture the nonlinear relationships inherent in the EEG emotion prediction task. This advantage is likely due to XGBoost's robustness to multicollinearity, its regularization mechanisms that prevent overfitting, and its ability to efficiently model complex interactions between features. Consequently, XGBoost was identified as the optimal model for subsequent analysis.

Model Interpretation

To interpret the model's prediction of EEG emotion scores, we applied the SHAP framework. SHAP provided insights into both global feature importance and individual prediction contributions. The global importance ranking (Figure 4B) showed that Harmonic Pitch Class Profile (HPCP), Root Mean Square (RMS) Energy, and Low Energy were the most influential features, followed by Hue, Spectral Contrast, Event Density, and Pitch Contour. The SHAP summary plot (Figure 4A) demonstrated that higher values of HPCP, RMS Energy, and Low Energy generally increased predicted EEG emotion scores, while lower values of Spectral Contrast and Pitch Contour were similarly associated with higher predictions. At the individual level, SHAP force plots (Figure 4C) illustrated how specific features contributed to a single prediction. For example, in one instance, Pitch Contour, Spectral Contrast, Low Energy, and RMS Energy increased the prediction relative to the base value, whereas HPCP, Event Density, and Hue decreased it. These results enhance the model's interpretability and confirm the relevance of multimodal features in predicting EEG-based emotional responses.

Figure 4.

SHAP was employed to interpret the model's predictions. (A) The SHAP summary plot displays the contribution of each feature, where each line represents a feature and the x-axis denotes the SHAP value. Red dots indicate higher feature values, while blue dots represent lower feature values. (B) The SHAP feature importance plot ranks features according to their overall contribution to the model's predictions. (C) The SHAP force plot illustrates how individual features influence a specific prediction.

Discussion

This study reveals the significant roles of auditory, visual, and lyrical modalities in shaping emotional responses during music video consumption. Among these modalities, auditory features emerged as the most influential in shaping emotional states, followed by visual features, with lyrical content exerting the least impact. This hierarchical pattern aligns with aesthetic theories proposing that low-level perceptual features—such as pitch, timbre, hue, and saturation—serve as primary drivers of emotional induction, shaping affect before higher-level cognitive interpretation occurs (Brattico & Pearce, 2013; Leder et al., 2004). Recent work further demonstrates that low-level artistic features—such as color temperature, brightness, and saturation—shape affective judgments through rapid perceptual pathways (Elliot & Maier, 2014). These findings support our EEG-based evidence that early, sensory-driven mechanisms play a central role in the formation of aesthetic emotions.

The prominence of the auditory modality in emotion induction aligns with existing research highlighting the emotional significance of sound in music. In music videos, the auditory channel provides continuous, structured emotional information that primarily influences affective judgments. This supports the channel dominance model (Somsaman, 2004), which suggests that viewers align their emotional perceptions with the dominant modality when modalities convey emotionally coherent cues. Lee et al. (2017) demonstrated that strong arousal signals from the auditory channel outweigh visual input in emotional appraisals, leading to sustained attentional focus and stable emotional responses. These findings underscore that music—through pitch and dynamics—serves as a primary emotional conduit, engaging both low-level sensory and higher-order affective processes (Koelsch, 2014; Schubert, 2004).

Auditory features, particularly pitch and dynamics, were the most significant predictors of emotional states in this study. The negative correlation between Pitch Class Profile (PCP), Harmonic Pitch Class Profile (HPCP), and emotional responses, as measured by EEG, contrasts with Ilie and Thompson's (2011) findings, which linked higher pitch in speech and lower pitch in music to positive emotions. Our findings indicate that higher-pitched musical elements—particularly those associated with dissonance or tension—tend to evoke negative emotional reactions, aligning with recent work on harmonic roughness and emotional disfluency in music (Athanasopoulos et al., 2021; Giannos et al., 2025). This pattern also aligns with contemporary auditory aesthetics research showing that harmonic roughness and spectral dissonance reliably modulate arousal and valence, often eliciting tension or negative affect (Smit et al., 2019). This discrepancy may reflect the complex interaction between dissonant pitch content and other sensory modalities, which together create multifaceted emotional experiences.

Similarly, the relationship between dynamics and emotional responses was nuanced. Low-energy sounds, characterized by softer, subdued musical elements, were positively associated with emotional states linked to calmness, in line with findings that connect low-energy sounds to positive emotions (Juslin & Västfjäll, 2008). In contrast, higher Root Mean Square (RMS) Energy, indicating intense or abrupt musical elements, was negatively correlated with emotional responses, suggesting that high energy can induce overstimulation or anxiety, particularly when the viewer's emotional state does not align with the music's intensity (Brattico et al., 2013; Huron, 2005). These results underscore the complexity of emotional processing in music videos, where the emotional impact of dynamics depends on both energy levels and the viewer's emotional state.

Although rhyme and timbre had a lesser impact on emotional responses compared to pitch and dynamics, their influence should not be overlooked. Rhyme enhances emotional resonance by reinforcing the emotional content in music (Mayer & Rauber, 2010). Similarly, timbre—conveying tonal qualities like warmth or sharpness—plays a key role in evoking distinct emotional states (Juslin & Laukka, 2003). Moreover, emotional information from one modality can influence emotional processing in another, especially when cues from one modality are ambiguous. This cross-modal interaction highlights the complexity of emotional processing, demonstrating that even subtle features like rhyme and timbre contribute to the overall emotional experience (Rigoulot & Pell, 2012).

Surprisingly, in the visual modality, hue (H) and saturation (S) emerged as the most influential features, rather than facial expressions or scene movement. While facial expressions were not the primary focus of this study, prior research suggests that even unconscious recognition of facial cues can influence emotional processing (de Gelder et al., 2002). In this study, lower saturation colors were associated with more positive emotional reactions, consistent with research suggesting that different colors evoke distinct emotional responses. For example, lighter or more saturated colors are often linked to happiness (Wright & Rainwater, 1962), while blue hues evoke calmness and red is associated with excitement (Hevner, 1935). These color-emotion associations are consistent across cultures (D’Andrade & Egan, 1974). This outcome aligns with emerging neuroaesthetic research showing that color properties influence affective processing through early visual pathways, modulating arousal and valence before complex semantic interpretation occurs (Palmer & Schloss, 2010). Our EEG-linked findings extend this work by demonstrating that color features meaningfully shape neural emotion responses during dynamic audiovisual experiences, supporting theories that emphasize visual tone as a foundational component of aesthetic emotion.

Overall, our findings resonate with and extend prominent neuroaesthetic frameworks—such as the Leder et al. (2004) model and the Brattico and Pearce (2013) model—which posit that aesthetic emotions emerge from interactions between early perceptual analysis and later cognitive appraisal. The strong predictive power of low-level auditory and visual features in EEG emotion responses suggests that aesthetic experience in music videos is primarily driven by fast, sensory-perceptual mechanisms, with higher-order meaning (e.g., lyrics) exerting only secondary influence.

In summary, this study provides valuable insights into the hierarchical and complex nature of multimodal emotional processing. Auditory features, particularly pitch and dynamics, dominate emotional induction, while color properties in the visual modality also significantly influence emotional responses. The interaction between modalities, especially when emotional cues are ambiguous, further enriches our understanding of how viewers process and respond emotionally to multimedia content.

Implications and Contributions

Theoretically, this study advances the understanding of multimodal emotional processing by establishing a validated hierarchy of sensory modalities—auditory>visual>lyrical—in shaping emotional responses to music videos. By integrating neurophysiological data (EEG) with audiovisual feature analysis, the research anchors emotional responses in both perceptual input and affective processing systems. Crucially, the findings demonstrate that specific low-level features—such as pitch dissonance, RMS energy, and color attributes like hue and saturation—differentially predict emotional outcomes. These results refine existing frameworks, such as the channel dominance model, and challenge reductive assumptions that associate high pitch or vivid visuals uniformly with positive affect. Notably, the strong influence of hue and saturation on emotional responses highlights the critical role of visual tone, an often-overlooked component in emotion-related studies of music video content, which typically prioritize narrative or facial expressions.

This study also distinguishes itself from prior research that often isolates a single modality for analysis. By adopting a multimodal approach and capturing interactions among auditory, visual, and lyrical cues, it offers a more ecologically valid account of how emotions are experienced in real-world multimedia environments. The inclusion of EEG-based evidence further strengthens the contribution by linking subjective emotional outcomes to measurable neural correlates, thus bridging psychological theory and affective neuroscience.

Practically, the findings hold valuable implications across several applied domains. In music, film, and digital media production, the demonstrated influence of pitch, dynamics, and visual color attributes offers actionable insights for shaping audience emotional engagement. Content creators and editors can deliberately manipulate auditory intensity or adjust visual tones to align with the intended emotional trajectory of a scene, enhancing narrative coherence and affective impact. In the realm of affective computing, where technologies aim to recognize and adapt to users’ emotional states, prioritizing auditory and color-based features may improve system sensitivity and responsiveness—particularly in recommendation engines or personalized user interfaces.

In therapeutic and educational contexts, this research supports the targeted use of low-energy soundscapes and calming visual palettes to promote emotional regulation. For instance, music therapists might employ subdued auditory and visual stimuli to reduce anxiety or enhance mood stability in clinical settings. Similarly, educators could apply these insights in designing emotionally supportive learning environments. The findings also inform user experience (UX) and interface design, suggesting that sensory alignment across modalities can contribute to more immersive, emotionally attuned digital interactions. In marketing and branding, understanding how specific sensory features evoke particular affective responses—such as excitement, calmness, or nostalgia—can enhance emotional resonance with target audiences and increase engagement.

Ultimately, this study offers a nuanced and empirically grounded framework for designing emotionally intelligent multimedia experiences. By revealing how individuals process and respond to the interplay of auditory, visual, and lyrical cues, it contributes both to the refinement of theoretical models and to the development of practical tools for emotion-driven design and communication.

Limitations and Future Directions

Several limitations should be noted. First, different modalities—such as auditory and visual channels—express emotions in distinct ways, with interactions that may lead to redundant or complementary information exchange. This complexity can influence emotional perception across various multimedia formats (Liu et al., 2023). Second, the relatively homogeneous sample of native Chinese-speaking undergraduate students may limit generalizability. Although the selected music videos possess broad cross-cultural appeal, future work should recruit more diverse participants to examine how factors such as cultural background, music preference, and emotional predispositions shape multimodal emotional processing. Third, Large Language Models (LLMs) were not employed in the current analysis due to inherent limitations in their design and application to EEG-based emotion recognition. LLMs are pretrained primarily on large-scale text corpora and lack the native capacity to interpret raw biological time-series data such as EEG (Lee et al., 2024; Wang et al., 2024). Even with structured preprocessing, their performance in cross-modal tasks remains unstable, prone to reasoning biases and output variability (Liu et al., 2026). Furthermore, fine-tuning LLMs on small, individualized EEG datasets is computationally demanding and susceptible to overfitting, rendering them impractical for this study's scale and design (Lee et al., 2024; Wang et al., 2024).

Finally, the study did not include psychometric assessments of participants’ psychological traits, such as mood, personality, emotion regulation tendencies, or baseline anxiety. Although this decision minimized participant burden and maintained focus on neural and multimodal features, individual psychological dispositions are known to modulate subjective and physiological emotional responses (Gross & John, 2003; Larsen & Ketelaar, 1991) and can alter neural dynamics during emotion processing (Moser et al., 2013). The absence of these measures may introduce unexplained variance in EEG responses. Future research should incorporate standardized instruments (e.g., PANAS, BFI, ERQ) to better account for individual differences and clarify how stable psychological traits interact with artistic features to shape emotional responses.

Conclusion

This study underscores the significant contributions of different sensory modalities to emotional responses during music video consumption. Among the auditory, visual, and lyrical modalities, the auditory channel emerged as the most influential in shaping emotional engagement, particularly through pitch and dynamics. The visual modality also played a crucial role, with color features—especially hue and saturation—significantly impacting emotional responses. In contrast, lyrical content had a relatively minor effect. These findings align with the channel dominance model, emphasizing the primacy of auditory input in conveying emotional cues when modalities present coherent emotional information. Moreover, the interaction between auditory and visual features highlights the complex, multimodal nature of emotional processing, where low-level sensory cues, such as pitch and color properties, significantly influence emotional states. The study also reveals that even less prominent features, like rhyme and timbre, contribute to the overall emotional experience.

In the context of Information Systems research, which often examines the comparison between different formats of information presentation, this study addresses the simultaneous perception of auditory and visual information in emotional processing. This research contributes to understanding how the human perceptual and cognitive systems process emotional information from multimodal sources. These insights provide a foundation for future investigations into information systems usage and offer valuable guidelines for the design of multimedia information systems. By ensuring that emotional cues are effectively conveyed and processed across different channels, the findings can aid in creating more engaging and emotionally resonant multimedia experiences.

Footnotes

ORCID iDs

Yuqing Liu

Yao Song

Ethical Approval and Informed Consent Statements

N/A

Funding

This work was supported by the Sichuan Province Philosophy and Social Sciences Fund Youth Talent Project (SCJJ25QN19), the HKBU Start-Up Grant, and the HKBU FASS Start-Up Research Fund.

Declaration of Conflicting Interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Data Availability Statement

The data that support the findings of this study are available from the corresponding author upon reasonable request.

References

Aftanas

L. I.

Golocheikine

S. A.

(2001). Human anterior and frontal midline theta and lower alpha reflect emotionally positive state and internalized attention: High-resolution EEG investigation of meditation. Neuroscience Letters, 310(1), 57–60. https://doi.org/10.1016/S0304-3940(01)02094-8

Alarcão

S. M.

Fonseca

M. J.

(2019). Emotions recognition using EEG signals: A survey. IEEE Transactions on Affective Computing, 10(3), 374–393. https://doi.org/10.1109/TAFFC.2017.2714671

Alhagry

Aly

(2017). Emotion recognition based on EEG using LSTM recurrent neural network. International Journal of Advanced Computer Science and Applications, 8(10), 355–358. https://doi.org/10.14569/IJACSA.2017.081046

Ansani

Marini

Mallia

Poggi

(2021). Music and time perception in audiovisuals: Arousing soundtracks lead to time overestimation no matter their emotional valence. Multimodal Technologies and Interaction, 5(11). https://doi.org/10.3390/mti5110068

Athanasopoulos

Eerola

Lahdelma

Kaliakatsos-Papakostas

(2021). Harmonic organisation conveys both universal and culture-specific cues for emotional expression in music. PloS One, 16(1), e0244964. https://doi.org/10.1371/journal.pone.0244964

Bai

Guo

Yang

(2020). Emotional monitoring of learners based on EEG signal recognition. Procedia Computer Science, 174, 364–368. https://doi.org/10.1016/j.procs.2020.06.100

Baltrusaitis

Ahuja

Morency

L.-P.

(2019). Multimodal machine learning: A survey and taxonomy. IEEE Transactions on Pattern Analysis and Machine Intelligence, 41(2), 423–443. https://doi.org/10.1109/TPAMI.2018.2798607

Bello

J. P.

Pickens

(2005). A robust mid-level representation for harmonic content in music signals. ISMIR, 5. https://zenodo.org/records/1417431

Bhattacharya

Gupta

R. K.

Yang

(2021). Exploring the contextual factors affecting multimodal emotion recognition in videos. IEEE Transactions on Affective Computing, 14(2), 1547–1557. https://doi.org/10.1109/TAFFC.2021.3071503

10.

Brattico

Bogert

Jacobsen

(2013). Toward a neural chronometry for the aesthetic experience of music. Frontiers in Psychology, 4(MAY), 1–21. https://doi.org/10.3389/fpsyg.2013.00206

11.

Brattico

Pearce

(2013). The neuroaesthetics of music. Psychology of Aesthetics, Creativity, and the Arts, 7(1), 48–61. https://doi.org/10.1037/a0031624

12.

Castellotti

Gragnoli

Baglioni

Criminisi

Giangrasso

Del Viva

M. M.

(2025). Art-induced psychological well-being: Individual traits shape the beneficial effects of aesthetic experiences. PLOS One, 20(11), e0332321. https://doi.org/10.1371/journal.pone.0332321

13.

Cespedes-Guevara

Eerola

(2018). Music communicates affects, not basic emotions – A constructionist account of attribution of emotional meanings to music. Frontiers in Psychology, 9, 215. https://doi.org/10.3389/fpsyg.2018.00215

14.

Chen

Lanlan

(2019). Emotion recognition of EEG based on ensemble convolutional neural networks. Journal of East China University of Science and Technology, 45(4), 614–622. https://doi.org/10.14135/j.cnki.1006-3080.20180416004

15.

Chuankun

Pichao

Shuang

Yonghong

Wanqing

(2017). Skeleton-based action recognition using LSTM and CNN. 2017 IEEE International Conference on Multimedia & Expo Workshops (ICMEW), 585–590. https://doi.org/10.1109/ICMEW.2017.8026287

16.

Codispoti

De Cesarei

Ferrari

(2023). Alpha-band oscillations and emotion: A review of studies on picture perception. Psychophysiology, 60(12). https://doi.org/10.1111/psyp.14438

17.

Coutinho

Scherer

K. R.

(2017). The effect of context and audio-visual modality on emotions elicited by a musical performance. Psychology of Music, 45(4), 550–569. https://doi.org/10.1177/0305735616670496

18.

Cui

Jin

J. S.

Zhang

Luo

Tian

(2010). Music video affective understanding using feature importance analysis. Proceedings of the ACM International Conference on Image and Video Retrieval, 213–219. https://doi.org/10.1145/1816041.1816074

19.

D’Andrade

Egan

(1974). The colors of emotion. American Ethnologist, 1(1), 49–63. https://doi.org/10.1525/ae.1974.1.1.02a00030

20.

de Gelder

Pourtois

Weiskrantz

(2002). Fear recognition in the voice is modulated by unconsciously recognized facial expressions but not by unconsciously recognized affective pictures. Proceedings of the National Academy of Sciences, 99(6), 4121–4126. https://doi.org/10.1073/pnas.062018499

21.

Deng

J. J.

Leung

C. H. C.

(2015). Dynamic time warping for music retrieval using time series modeling of musical emotions. IEEE Transactions on Affective Computing, 6(2), 137–151. https://doi.org/10.1109/TAFFC.2015.2404352

22.

Desai

Field

A. M.

Hamilton

L. S.

(2024). A comparison of EEG encoding models using audiovisual stimuli and their unimodal counterparts. PLOS Computational Biology, 20(9), e1012433. https://doi.org/10.1371/journal.pcbi.1012433

23.

Dibben

(2004). The role of peripheral feedback in emotional experience with music. Music Perception, 22(1), 79–115. https://doi.org/10.1525/mp.2004.22.1.79

24.

Guo

Qiu

(2021). EEG Emotional recognition based on thresholdless recurrence plot and deep residual network. Computer Applications and Software (in Chinese), 38(4), 177–183. https://doi.org/10.3969/i.issn.1000-386x.2021.04.029

25.

Elliot

A. J.

Maier

M. A.

(2014). Color psychology: Effects of perceiving color on psychological functioning in humans. Annual Review of Psychology, 65(1), 95–120. https://doi.org/10.1146/annurev-psych-010213-115035

26.

Elowsson

Anders

(2013). Modelling perception of speed in music audio. Proceedings of the Sound and Music Computing Conference 2013, SMC 2013, 735–741.

27.

Fan

Tatar

Thorogood

Pasquier

(2017). Ranking-based emotion recognition for experimental music. Proceedings of the 18th International Society for Music Information Retrieval Conference, ISMIR 2017, 368–375.

28.

Ting

K. M.

Zhang

(2011). A survey of audio-based music classification and annotation. IEEE Transactions on Multimedia, 13(2), 303–319. https://doi.org/10.1109/TMM.2010.2098858

29.

Geringer

J. M.

Cassidy

J. W.

Byo

J. L.

(1996). Effects of music with video on responses of nonmusic majors: An exploratory study. Journal of Research in Music Education, 44(3), 240–251. https://doi.org/10.2307/3345597

30.

Giannos

Athanasopoulos

Küssner

M. B.

(2025). From Bach to pélog. Music Perception: An Interdisciplinary Journal, 42(4), 345–361. https://doi.org/10.1525/mp.2025.2324025

31.

Gross

J. J.

John

O. P.

(2003). Individual differences in two emotion regulation processes: Implications for affect, relationships, and well-being. Journal of Personality and Social Psychology, 85(2), 348–362. https://doi.org/10.1037/0022-3514.85.2.348

32.

Han

Kong

Han

Wang

(2022). A survey of music emotion recognition. Frontiers of Computer Science, 16(6), 166335. https://doi.org/10.1007/s11704-021-0569-4

33.

Hevner

(1935). Expression in music: A discussion of experimental studies and theories. Psychological Review, 42(2), 186–204. https://doi.org/10.1037/h0054832

34.

Hunter

P. G.

Schellenberg

E. G.

Schimmack

(2010). Feelings and perceptions of happiness and sadness induced by music: Similarities, differences, and mixed emotions. Psychology of Aesthetics, Creativity, and the Arts, 4(1), 47–56. https://doi.org/10.1037/a0016873

35.

Huron

(2005). The plural pleasures of music. Proceedings of the 2004 Music and Music Science Conference, 1–13.

36.

Ilie

Thompson

W. F.

(2011). Experiential and cognitive changes following seven minutes exposure to music and speech. Music Perception, 28(3), 247–264. https://doi.org/10.1525/mp.2011.28.3.247

37.

Jenkins

Brown

Rutterford

(2009). Comparing thermographic, EEG, and subjective measures of affective experience during simulated product interactions. International Journal of Design, 3(2), 53–65.

38.

Jiang

D.-N.

Zhang

H.-J.

Tao

J.-H.

Cai

L.-H.

(2002). Music type classification by spectral contrast feature. Proceedings. IEEE International Conference on Multimedia and Expo, 113–116. https://doi.org/10.1109/ICME.2002.1035731

39.

Juslin

P. N.

Laukka

(2003). Communication of emotions in vocal expression and music performance: Different channels, same code? Psychological Bulletin, 129(5), 770–814. https://doi.org/10.1037/0033-2909.129.5.770

40.

Juslin

P. N.

Västfjäll

(2008). Emotional responses to music: The need to consider underlying mechanisms. Behavioral and Brain Sciences, 31(5), 559–575. https://doi.org/10.1017/S0140525X08005293

41.

Kallinen

Ravaja

(2006). Emotion perceived and emotion felt: Same and different. Musicae Scientiae, 10(2), 191–213. https://doi.org/10.1177/102986490601000203

42.

Kim

Provost

E. M.

(2016). Emotion spotting: Discovering regions of evidence in audio-visual emotion expressions. Proceedings of the 18th ACM International Conference on Multimodal Interaction, 92–99. https://doi.org/10.1145/2993148.2993151

43.

Knyazev

G. G.

(2012). EEG delta oscillations as a correlate of basic homeostatic and motivational processes. Neuroscience & Biobehavioral Reviews, 36(1), 677–695. https://doi.org/10.1016/j.neubiorev.2011.10.002

44.

Koelsch

(2014). Brain correlates of music-evoked emotions. Nature Reviews Neuroscience, 15(3), 170–180. https://doi.org/10.1038/nrn3666

45.

Krahé

Berger

Vanwesenbeeck

Bianchi

Chliaoutakis

Fernández-Fuertes

A. A.

Zygadło

(2015). Prevalence and correlates of young people’s sexual aggression perpetration and victimisation in 10 European countries: A multi-level analysis. Culture, Health & Sexuality, 17(6), 682–699. https://doi.org/10.1080/13691058.2014.989265

46.

Larsen

R. J.

Ketelaar

(1991). Personality and susceptibility to positive and negative emotional states. Journal of Personality and Social Psychology, 61(1), 132–140. https://doi.org/10.1037/0022-3514.61.1.132

47.

Lavy

M. M.

(2001). Emotion and the experience of listening to music: A framework for empirical research. England: University of Cambridge.

48.

Leder

Belke

Oeberst

Augustin

(2004). A model of aesthetic appreciation and aesthetic judgments. British Journal of Psychology, 95(4), 489–508. https://doi.org/10.1348/0007126042369811

49.

Lee

Jacoby

Hennequin

Moussallam

(2024). Tracing the mechanisms of cultural diversity through 2.5 million individuals’ music listening patterns. In Preprint at. https://doi.org/10.31234/osf.io/73kyf

50.

Lee

I. E.

Latchoumane

C.-F. V.

Jeong

(2017). Arousal rules: An empirical investigation into the aesthetic experience of cross-modal perception with emotional visual music. Frontiers in Psychology, 8(MAR), 1–31. https://doi.org/10.3389/fpsyg.2017.00440

51.

Zhao

(2023). Visual-audio correspondence and its effect on video tipping: Evidence from Bilibili vlogs. Information Processing & Management, 60(3), 103347. https://doi.org/10.1016/j.ipm.2023.103347

52.

Liang

Zhang

(2024). Enhancing image sentiment analysis: A user-centered approach through user emotions and visual features. Information Processing & Management, 61(4), 103749. https://doi.org/10.1016/j.ipm.2024.103749

53.

Liao

Kruger

Doherty

(2020). The impact of monolingual and bilingual subtitles on visual attention, cognitive load, and comprehension. The Journal of Specialised Translation, (33), 70–98. https://doi.org/10.26034/cm.jostrans.2020.549

54.

Liikkanen

L. A.

Salovaara

(2015). Music on YouTube: User engagement with traditional, user-appropriated and derivative videos. Computers in Human Behavior, 50, 108–124. https://doi.org/10.1016/j.chb.2015.01.067

55.

Liu

Gao

Ding

(2023). Multi-modal fusion network with complementarity and importance for emotion recognition. Information Sciences, 619, 679–694. https://doi.org/10.1016/j.ins.2022.11.076

56.

Liu

Zhong

Wang

Song

(2026). A study of danmu: Detecting emotional coherence in music videos through synchronized EEG analysis. Computers in Human Behavior, 174, 108803. https://doi.org/10.1016/j.chb.2025.108803

57.

Lundberg

Lee

S.-I.

(2017). A Unified Approach to Interpreting Model Predictions. http://arxiv.org/abs/1705.07874.

58.

Mastandrea

Fagioli

Biasi

(2019). Art and Psychological Well-Being: Linking the Brain to the Aesthetic Emotion. Frontiers in Psychology, 10, https://doi.org/10.3389/fpsyg.2019.00739

59.

Mayer

Rauber

(2010). Multimodal aspects of music retrieval: Audio, song lyrics – and beyond? Studies in Computational Intelligence, 274, 333–363. https://doi.org/10.1007/978-3-642-11674-2_15

60.

Mehrabian

Ferris

S. R.

(1967). Inference of attitudes from nonverbal communication in two channels. Journal of Consulting Psychology, 31(3), 248–252. https://doi.org/10.1037/h0024648

61.

Moser

J. S.

Moran

T. P.

Schroder

H. S.

Donnellan

M. B.

Yeung

(2013). On the relationship between anxiety and error monitoring: a meta-analysis and conceptual framework. Frontiers in Human Neuroscience, 7, 466. https://doi.org/10.3389/fnhum.2013.00466

62.

Muszynski

Tian

Lai

Moore

J. D.

Kostoulas

Lombardo

Pun

Chanel

(2021). Recognizing induced emotions of movie audiences from multimodal information. IEEE Transactions on Affective Computing, 12(1), 36–52. https://doi.org/10.1109/TAFFC.2019.2902091

63.

Nair

Hinton

G. E.

(2010). Rectified linear units improve Restricted Boltzmann machines. ICML 2010 - Proceedings, 27th International Conference on Machine Learning, 3, 807–814.

64.

Ong

B. S.

Gomez

Sreich

(2006). Automatic extraction of musical structure using pitch class distribution features. International Journal of Production Research, 37(6), 1–33.

65.

Palmer

S. E.

Schloss

K. B.

(2010). An ecological valence theory of human color preference. Proceedings of the National Academy of Sciences, 107(19), 8877–8882. https://doi.org/10.1073/pnas.0906172107

66.

Pandeya

Y. R.

Bhattarai

Lee

(2021a). Music video emotion classification using slow – fast audio – video network and unsupervised feature representation. Scientific Reports, 11(1), 1–14. https://doi.org/10.1038/s41598-021-98856-2

67.

Pandeya

Y. R.

Bhattarai

Lee

(2021b). Deep-Learning-Based multimodal emotion classification for music videos. Sensors, 21(14), 4927. https://doi.org/10.3390/s21144927

68.

Papez

J. W.

(1937). A proposed mechanism of emotion. Archives of Neurology And Psychiatry, 38(4), 725. https://doi.org/10.1001/archneurpsyc.1937.02260220069003

69.

Plourde-Kelly

A. D.

Saroka

K. S.

Dotta

B. T.

(2021). The impact of emotionally valenced music on emotional state and EEG profile: Convergence of self-report and quantitative data. Neuroscience Letters, 758(May), 136009. https://doi.org/10.1016/j.neulet.2021.136009

70.

Poria

Hazarika

Majumder

Mihalcea

(2023). Beneath the tip of the iceberg: Current challenges and new directions in sentiment analysis research. IEEE Transactions on Affective Computing, 14(1), 108–132. https://doi.org/10.1109/TAFFC.2020.3038167

71.

Rigoulot

Pell

M. D.

(2012). Seeing emotion with your ears: Emotional prosody implicitly guides visual attention to faces. PLoS ONE, 7(1), e30740. https://doi.org/10.1371/journal.pone.0030740

72.

Rouast

P. V.

Adam

M. T. P.

Chiong

(2021). Deep learning for human affect recognition: Insights and new developments. IEEE Transactions on Affective Computing, 12(2), 524–543. https://doi.org/10.1109/TAFFC.2018.2890471

73.

Scherer

Zentner

(2008). Music evoked emotions are different - more often aesthetic than utilitarian. Behavioral and Brain Sciences, 31(5), 595–596. https://doi.org/10.1017/S0140525X08005505

74.

Schreuder

van Erp

Toet

Kallen

V. L.

(2016). Emotional Responses to Multisensory Environmental Stimuli: A Conceptual Framework and Literature Review. SAGE Open, 6(1). https://doi.org/10.1177/2158244016630591

75.

Schubert

(2004). Modeling perceived emotion with continuous musical features. Music Perception, 21(4), 561–585. https://doi.org/10.1525/mp.2004.21.4.561

76.

Smit

E. A.

Milne

A. J.

Dean

R. T.

Weidemann

(2019). Perception of affect in unfamiliar musical chords. PLOS ONE, 14(6), e0218570. https://doi.org/10.1371/journal.pone.0218570

77.

Soleymani

Pantic

Pun

(2012). Multimodal emotion recognition in response to videos. IEEE Transactions on Affective Computing, 3(2), 211–223. https://doi.org/10.1109/T-AFFC.2011.37

78.

Somsaman

(2004). The perception of emotions in multimedia: An empirical test of three models of conformance and contest. USA: Case Western Reserve University.

79.

Song

Dixon

Pearce

(2012). Evaluation of musical features for emotion classification. Proceedings of the 13th International Society for Music Information Retrieval Conference, ISMIR 2012, June, 523–528.

80.

Susino

Thompson

W. F.

Schubert

Broughton

(2025). Emotional responses to music: The essential inclusion of emotion adaptability and situational context. Empirical Studies of the Arts, 43(1), 451–483. https://doi.org/10.1177/02762374241237683

81.

Thompson

W. F.

Quinto

(2012). Music and emotion: Psychological considerations. The Aesthetic Mind: Philosophy and Psychology, 357–375. https://doi.org/10.1093/acprof:oso/9780199691517.003.0022

82.

Tian

Gao

Xiao

Liu

(2020). SKEP: Sentiment Knowledge Enhanced Pre-training for Sentiment Analysis. https://doi.org/10.48550/arXiv.2005.05635

83.

Tibshirani

(1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society Series B: Statistical Methodology, 58(1), 267–288. https://doi.org/10.1111/j.2517-6161.1996.tb02080.x

84.

Tsai

C.-G.

Yang

C.-M.

Chen

C.-C.

Chen

I.-P.

Liang

K.-C.

(2015). Relaxation and executive control processes in listeners: An exploratory study of music-induced transient suppression of skin conductance responses. Empirical Studies of the Arts, 33(2), 125–143. https://doi.org/10.1177/0276237415594707

85.

Vines

B. W.

Krumhansl

C. L.

Wanderley

M. M.

Dalca

I. M.

Levitin

D. J.

(2011). Music to my eyes: Cross-modal interactions in the perception of emotions in musical performance. Cognition, 118(2), 157–170. https://doi.org/10.1016/j.cognition.2010.11.010

86.

Vos

P. G.

Troost

J. M.

(1989). Ascending and descending melodic intervals: Statistical findings and their perceptual relevance. Music Perception, 6(4), 383–396. https://doi.org/10.2307/40285439

87.

Vuoskoski

J. K.

Gatti

Spence

Clarke

E. F.

(2016). Do visual cues intensify the emotional responses evoked by musical performance? A psychophysiological investigation. Psychomusicology: Music, Mind, and Brain, 26(2), 179–188. https://doi.org/10.1037/pmu0000142

88.

Wang

Peng

Zheng

Zhao

Zhu

(2024a). A cross modal hierarchical fusion multimodal sentiment analysis method based on multi-task learning. Information Processing and Management, 61(3), 103675. https://doi.org/10.1016/j.ipm.2024.103675

89.

Wang

Zheng

Pan

Chang

Zhu

Wang

Xiao

(2024b). VAD: A video affective dataset with danmu. IEEE Transactions on Affective Computing, 15(4), 1889–1905. https://ieeexplore.ieee.org/document/10480614 .

90.

Wei

Chen

Song

Lou

(2020). EEG-based emotion recognition using simple recurrent units network and ensemble learning. Biomedical Signal Processing and Control, 58, 101756. https://doi.org/10.1016/j.bspc.2019.101756

91.

Wright

Rainwater

(1962). The meanings of color. The Journal of General Psychology, 67(1), 89–99. https://doi.org/10.1080/00221309.1962.9711531

92.

Yang

Huang

Yang

(2020). EEG-based emotion classification based on bidirectional long short-term memory network. Procedia Computer Science, 174, 491–504. https://doi.org/10.1016/j.procs.2020.06.117

93.

Zhang

(2021). Music feature extraction and classification algorithm based on deep learning. Scientific Programming, 2021(1), 1651560. https://doi.org/10.1155/2021/1651560

94.

Zheng

W.-L.

Liu

B.-L.

Cichocki

(2019). Emotionmeter: A multimodal framework for recognizing human emotions. IEEE Transactions on Cybernetics, 49(3), 1110–1122. https://doi.org/10.1109/TCYB.2018.2797176

Dominance of Audio Features in Audience Emotion: A Multimodal EEG Analysis of Music Videos

Abstract

Keywords

Introduction

Literature Review

Perceived and Induced Emotions

Multimodal Emotion Recognition

Modality-Specific Contributions

Modeling Multimodal Emotional Integration

Method

Material Preparation

Material Multimodal Analysis

EEG Emotion Analysis

Data Source and Preprocessing

Participants and Experimental Design

Model Construction

Analysis Process

Results

Descriptive Analysis

Feature Selection

Modelling Analysis

Model Interpretation

Discussion

Implications and Contributions

Limitations and Future Directions

Conclusion

Footnotes

ORCID iDs

Ethical Approval and Informed Consent Statements

Funding

Declaration of Conflicting Interests

Data Availability Statement

References