A Hierarchical Cross-modal Spatial Fusion Network for Multimodal Emotion Recognition
A Hierarchical Cross-modal Spatial Fusion Network for Multimodal Emotion Recognition
This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TAI.2024.3523250
Abstract—Recent advancements in emotion recognition re- social activities, and decision-making processes [1], [2]. Artifi-
search based on physiological data have been notable. However, cial intelligence for analyzing of various information sources
existing multimodal methods often overlook the interrelations such as facial expressions, voice tones, and textual content
between various modalities, such as video and Electroencephalog-
raphy data, in emotion recognition. In this paper, a feature fusion- enables the inference and recognition of individual emotional
based hierarchical cross-modal spatial fusion network is proposed states [3], [4]. The application range of emotion recognition is
that effectively integrates EEG and video features. By designing broad, encompassing areas such as safety control, healthcare,
an Electroencephalography feature extraction network based on education, and child growth monitoring [5]–[7]. Electroen-
1D convolution and a video feature extraction network based cephalography (EEG) data and video data are commonly used
on 3D convolution, corresponding modality features are thor-
oughly extracted. To promote sufficient interaction between the sources for emotion recognition [8]. EEG data provide insights
two modalities, a hierarchical cross-modal coordinated attention into emotional states by measuring brain electrical signals,
module is proposed in this paper. Additionally, to enhance the while video data recognize emotions by analyzing facial
network’s perceptual ability for emotion-related features, a multi- expressions, eye movements, and head postures. Each data
scale spatial pyramid pooling module is also designed. Mean- type has strengths and weaknesses: EEG data directly reflect
while, a self-distillation method is introduced, which enhances
the performance while reducing the number of parameters in internal feelings but can be susceptible to noise, individual
the network. The hierarchical cross-modal spatial fusion network differences, and equipment limitations [9]. On the other hand,
achieved an accuracy of 97.78% on the valence-arousal dimension video data are more accessible and easier to process but can
of the DEAP dataset, and it also obtained an accuracy of 60.59% be affected by factors such as environmental lighting, facial
on the MAHNOB-HCI dataset, reaching the state-of-the-art level. obstructions, and feigned expressions [10], [11]. Therefore,
effectively leveraging these two data types to enhance emotion
Impact Statement—Emotion recognition is a multidisciplinary recognition accuracy is a significant research focus.
field of research that spans psychology, neuroscience, and artifi- In recent years, deep learning has emerged as the pre-
cial intelligence. It has significant implications for deepening our
understanding of human emotions, improving human-computer dominant research approach for emotion recognition [12]–
interaction, and advancing mental health treatment. Our re- [15]. Ozdemir et al. employed convolutional neural networks
search on the Hierarchical Cross-Modal Spatial Fusion Network (CNNs) with multi-spectral topographic images to classify
(HCSFNet) goes beyond traditional studies focusing solely on EEG signals, achieving an average accuracy of 88.38% on the
physiological data. By exploring the complex interactions between DEAP dataset [16]. Yin et al. proposed a fusion network that
video and EEG data, HCSFNet optimizes emotion recognition
performance, and introduces a universally applicable approach. combining generative adversarial networks (GANs) with long
Its applications range from improving emotional state recogni- short-term memory (LSTM) networks, achieving an average
tion in mental health patients to promoting intelligent driver accuracy of 90.53% on the same dataset [17]. Rajpoot et al.
assistance services and enhancing human-computer interaction introduced an LSTM network with channel attention to capture
in robotics and smart homes. This research represents a major intrinsic variables in EEG data. They then extracted features
advance with broad implications for academic understanding and
practical applications. using an attention-based CNN, resulting in an accuracy of
76.7% on the SEED dataset [18]. Zhang et al. employed a
Index Terms—Emotion recognition, multimodal, attention, self- CNN-LSTM network to extract spatial features that represent
distillation, cross-modal, DEAP
functional relationships between EEG signals from different
electrodes, achieving an accuracy of 94.17% on the DEAP
I. I NTRODUCTION dataset [3]. Guo et al. introduced a DCoT network that com-
bines depth-wise convolution and Transformer architecture,
MOTIONS play an indispensable role in everyday life,
E significantly influencing individual psychological states,
achieving an average accuracy of 93.83% on the SEED dataset
[19].
This work was supported by Beijing Natural Science Foundation under As research progresses, unimodal methods in emotion
grant L241016 and National Natural Science Foundation of China under grant recognition face challenges and limitations [20]. Emotion
62473223, 62163012. (Corresponding author: Xiao He) recognition based on physiological information may be hin-
Ming Xu, Zeyi Liu and Xiao He are with the Department of
Automation, Tsinghua University, Beijing 100084, P. R. China. dered by the complexity and noise sensitivity of physio-
(emails: [email protected], [email protected], logical signals, while image-based recognition might miss
[email protected]). subtle facial expression changes. Consequently, researchers are
Tuo Shi and Hao Zhang are with the Shenzhen ZNV Technology Co.,
Ltd, Shenzhen 518063, P. R. China (e-mails: [email protected], zhang- actively exploring multimodal methods, investigating how to
[email protected]). combine various data sources for more comprehensive and
Authorized licensed use limited to: JADAVPUR UNIVERSITY. Downloaded on January 11,2025 at 09:33:30 UTC from IEEE Xplore. Restrictions apply.
© 2024 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Transactions on Artificial Intelligence. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TAI.2024.3523250
accurate emotion recognition [21]–[23]. Wang et al. used introduce self-distillation to multimodal emotion recog-
facial images, audio signals, and EEG for emotion recognition, nition based on EEG and Video data, thereby HCSFNet
achieving a 71.24% accuracy on the SEED dataset [24]. Fu requires fewer computational resources and achieves
et al. proposed an MFFNN that fuses eye-tracking and EEG better accuracy.
signals, achieving 87.32% accuracy on the SEED dataset [25]. The remainder of this paper is structured as follows: Section
Wu et al. developed a Bi-LSTM network for feature fusion II discusses related works in the field. Section III introduces
and extracting video-EEG signals, reaching a 95.12% average three key network models: PMNet, VMNet, and HCSFNet.
accuracy on the DEAP dataset [26]. Lopez et al. proposed a Section IV details the experimental setup, results obtained, and
hypercomplex multimodal network comprising parameterized comparisons with other state-of-the-art methods. In Section V,
hypercomplex multiplications, effectively combining EEG and the key findings and contributions are concluded.
peripheral physiological signals [27]. However, the multi-
modal data sources here are all physiological signals, and II. R ELATED W ORK
eye movement signals may not sufficiently express facial
A. Emotion Recognition Classification
emotional states. Wang et al. proposed an end-to-end multi-
modal transformer framework that uses cross-modal fusion to Emotions are abstract expressions, and the prerequisite for
integrate data from different modalities, prioritizing contextual employing a network to recognize emotions is to transform
information through self-attention mechanisms. However, the them into quantifiable indicators. There are currently two
framework may encounter challenges in customization and primary models for this transformation: discrete and dimen-
adaptation for specific application scenarios [28]. A dual- sional [31]. The former classifies emotions into categories such
stream heterogeneous graph recurrent neural network for the as surprise, fear, sadness, joy, disgust, and anger, while the
fusion of multimodal physiological signals was proposed for latter characterizes emotions based on Valence-Arousal (V-A).
emotion recognition [29]. The computational costs and model These two emotional representation methods are illustrated
complexity could be substantial. in Fig. 1, which demonstrates that, in comparison to the
discrete depiction of the emotion wheel, the V-A dimensional
In this paper, a novel feature fusion-based network called the
model provides a continuous representation of emotions. This
hierarchical cross-modal spatial fusion network (HCSFNet)
detailed representation allows for the distinction of more
is proposed, which consists of two key components: Phys-
nuanced emotional states, aligning with the intricacies of
iological Model Network (PMNet), a 1D convolution-based
human emotions and assisting researchers in quantitatively
EEG feature extraction network, and Video Model Network
analyzing emotions.
(VMNet), a 3D convolution-based video feature extraction
network. To underscore the significance of pivotal features in
) H D U Arousal
multimodal fusion and to enhance the capture of intermodal Furious
correlations, we propose a hierarchical cross-modal collabora- : R U U \ '