0% found this document useful (0 votes)
25 views10 pages

A Hierarchical Cross-modal Spatial Fusion Network for Multimodal Emotion Recognition

The article presents a novel Hierarchical Cross-Modal Spatial Fusion Network (HCSFNet) designed for multimodal emotion recognition, integrating EEG and video features through advanced feature extraction and attention mechanisms. The proposed network achieves state-of-the-art accuracy on the DEAP and MAHNOB-HCI datasets, highlighting its effectiveness in capturing intermodal correlations and enhancing emotion recognition performance. This research contributes significantly to the fields of psychology, neuroscience, and artificial intelligence by improving human-computer interaction and mental health applications.

Uploaded by

cowaket871
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views10 pages

A Hierarchical Cross-modal Spatial Fusion Network for Multimodal Emotion Recognition

The article presents a novel Hierarchical Cross-Modal Spatial Fusion Network (HCSFNet) designed for multimodal emotion recognition, integrating EEG and video features through advanced feature extraction and attention mechanisms. The proposed network achieves state-of-the-art accuracy on the DEAP and MAHNOB-HCI datasets, highlighting its effectiveness in capturing intermodal correlations and enhancing emotion recognition performance. This research contributes significantly to the fields of psychology, neuroscience, and artificial intelligence by improving human-computer interaction and mental health applications.

Uploaded by

cowaket871
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

This article has been accepted for publication in IEEE Transactions on Artificial Intelligence.

This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TAI.2024.3523250

IEEE TRANSACTIONS ON ARTIFICIAL INTELLIGENCE 1

A Hierarchical Cross-modal Spatial Fusion Network


for Multimodal Emotion Recognition
Ming Xu, Tuo Shi, Hao Zhang, Zeyi Liu, Xiao He, Senior Member, IEEE

Abstract—Recent advancements in emotion recognition re- social activities, and decision-making processes [1], [2]. Artifi-
search based on physiological data have been notable. However, cial intelligence for analyzing of various information sources
existing multimodal methods often overlook the interrelations such as facial expressions, voice tones, and textual content
between various modalities, such as video and Electroencephalog-
raphy data, in emotion recognition. In this paper, a feature fusion- enables the inference and recognition of individual emotional
based hierarchical cross-modal spatial fusion network is proposed states [3], [4]. The application range of emotion recognition is
that effectively integrates EEG and video features. By designing broad, encompassing areas such as safety control, healthcare,
an Electroencephalography feature extraction network based on education, and child growth monitoring [5]–[7]. Electroen-
1D convolution and a video feature extraction network based cephalography (EEG) data and video data are commonly used
on 3D convolution, corresponding modality features are thor-
oughly extracted. To promote sufficient interaction between the sources for emotion recognition [8]. EEG data provide insights
two modalities, a hierarchical cross-modal coordinated attention into emotional states by measuring brain electrical signals,
module is proposed in this paper. Additionally, to enhance the while video data recognize emotions by analyzing facial
network’s perceptual ability for emotion-related features, a multi- expressions, eye movements, and head postures. Each data
scale spatial pyramid pooling module is also designed. Mean- type has strengths and weaknesses: EEG data directly reflect
while, a self-distillation method is introduced, which enhances
the performance while reducing the number of parameters in internal feelings but can be susceptible to noise, individual
the network. The hierarchical cross-modal spatial fusion network differences, and equipment limitations [9]. On the other hand,
achieved an accuracy of 97.78% on the valence-arousal dimension video data are more accessible and easier to process but can
of the DEAP dataset, and it also obtained an accuracy of 60.59% be affected by factors such as environmental lighting, facial
on the MAHNOB-HCI dataset, reaching the state-of-the-art level. obstructions, and feigned expressions [10], [11]. Therefore,
effectively leveraging these two data types to enhance emotion
Impact Statement—Emotion recognition is a multidisciplinary recognition accuracy is a significant research focus.
field of research that spans psychology, neuroscience, and artifi- In recent years, deep learning has emerged as the pre-
cial intelligence. It has significant implications for deepening our
understanding of human emotions, improving human-computer dominant research approach for emotion recognition [12]–
interaction, and advancing mental health treatment. Our re- [15]. Ozdemir et al. employed convolutional neural networks
search on the Hierarchical Cross-Modal Spatial Fusion Network (CNNs) with multi-spectral topographic images to classify
(HCSFNet) goes beyond traditional studies focusing solely on EEG signals, achieving an average accuracy of 88.38% on the
physiological data. By exploring the complex interactions between DEAP dataset [16]. Yin et al. proposed a fusion network that
video and EEG data, HCSFNet optimizes emotion recognition
performance, and introduces a universally applicable approach. combining generative adversarial networks (GANs) with long
Its applications range from improving emotional state recogni- short-term memory (LSTM) networks, achieving an average
tion in mental health patients to promoting intelligent driver accuracy of 90.53% on the same dataset [17]. Rajpoot et al.
assistance services and enhancing human-computer interaction introduced an LSTM network with channel attention to capture
in robotics and smart homes. This research represents a major intrinsic variables in EEG data. They then extracted features
advance with broad implications for academic understanding and
practical applications. using an attention-based CNN, resulting in an accuracy of
76.7% on the SEED dataset [18]. Zhang et al. employed a
Index Terms—Emotion recognition, multimodal, attention, self- CNN-LSTM network to extract spatial features that represent
distillation, cross-modal, DEAP
functional relationships between EEG signals from different
electrodes, achieving an accuracy of 94.17% on the DEAP
I. I NTRODUCTION dataset [3]. Guo et al. introduced a DCoT network that com-
bines depth-wise convolution and Transformer architecture,
MOTIONS play an indispensable role in everyday life,
E significantly influencing individual psychological states,
achieving an average accuracy of 93.83% on the SEED dataset
[19].
This work was supported by Beijing Natural Science Foundation under As research progresses, unimodal methods in emotion
grant L241016 and National Natural Science Foundation of China under grant recognition face challenges and limitations [20]. Emotion
62473223, 62163012. (Corresponding author: Xiao He) recognition based on physiological information may be hin-
Ming Xu, Zeyi Liu and Xiao He are with the Department of
Automation, Tsinghua University, Beijing 100084, P. R. China. dered by the complexity and noise sensitivity of physio-
(emails: [email protected], [email protected], logical signals, while image-based recognition might miss
[email protected]). subtle facial expression changes. Consequently, researchers are
Tuo Shi and Hao Zhang are with the Shenzhen ZNV Technology Co.,
Ltd, Shenzhen 518063, P. R. China (e-mails: [email protected], zhang- actively exploring multimodal methods, investigating how to
[email protected]). combine various data sources for more comprehensive and

Authorized licensed use limited to: JADAVPUR UNIVERSITY. Downloaded on January 11,2025 at 09:33:30 UTC from IEEE Xplore. Restrictions apply.
© 2024 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Transactions on Artificial Intelligence. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TAI.2024.3523250

IEEE TRANSACTIONS ON ARTIFICIAL INTELLIGENCE 2

accurate emotion recognition [21]–[23]. Wang et al. used introduce self-distillation to multimodal emotion recog-
facial images, audio signals, and EEG for emotion recognition, nition based on EEG and Video data, thereby HCSFNet
achieving a 71.24% accuracy on the SEED dataset [24]. Fu requires fewer computational resources and achieves
et al. proposed an MFFNN that fuses eye-tracking and EEG better accuracy.
signals, achieving 87.32% accuracy on the SEED dataset [25]. The remainder of this paper is structured as follows: Section
Wu et al. developed a Bi-LSTM network for feature fusion II discusses related works in the field. Section III introduces
and extracting video-EEG signals, reaching a 95.12% average three key network models: PMNet, VMNet, and HCSFNet.
accuracy on the DEAP dataset [26]. Lopez et al. proposed a Section IV details the experimental setup, results obtained, and
hypercomplex multimodal network comprising parameterized comparisons with other state-of-the-art methods. In Section V,
hypercomplex multiplications, effectively combining EEG and the key findings and contributions are concluded.
peripheral physiological signals [27]. However, the multi-
modal data sources here are all physiological signals, and II. R ELATED W ORK
eye movement signals may not sufficiently express facial
A. Emotion Recognition Classification
emotional states. Wang et al. proposed an end-to-end multi-
modal transformer framework that uses cross-modal fusion to Emotions are abstract expressions, and the prerequisite for
integrate data from different modalities, prioritizing contextual employing a network to recognize emotions is to transform
information through self-attention mechanisms. However, the them into quantifiable indicators. There are currently two
framework may encounter challenges in customization and primary models for this transformation: discrete and dimen-
adaptation for specific application scenarios [28]. A dual- sional [31]. The former classifies emotions into categories such
stream heterogeneous graph recurrent neural network for the as surprise, fear, sadness, joy, disgust, and anger, while the
fusion of multimodal physiological signals was proposed for latter characterizes emotions based on Valence-Arousal (V-A).
emotion recognition [29]. The computational costs and model These two emotional representation methods are illustrated
complexity could be substantial. in Fig. 1, which demonstrates that, in comparison to the
discrete depiction of the emotion wheel, the V-A dimensional
In this paper, a novel feature fusion-based network called the
model provides a continuous representation of emotions. This
hierarchical cross-modal spatial fusion network (HCSFNet)
detailed representation allows for the distinction of more
is proposed, which consists of two key components: Phys-
nuanced emotional states, aligning with the intricacies of
iological Model Network (PMNet), a 1D convolution-based
human emotions and assisting researchers in quantitatively
EEG feature extraction network, and Video Model Network
analyzing emotions.
(VMNet), a 3D convolution-based video feature extraction
network. To underscore the significance of pivotal features in
)HDU Arousal
multimodal fusion and to enhance the capture of intermodal Furious
correlations, we propose a hierarchical cross-modal collabora- :RUU\ 'LVJXVW Angry
Excited
Terrified
Happy
tive attention (HCCA) module. Additionally, we incorporate  ANGER JOY

 
Afraid Annoyed Interested
the multi-scale spatial pyramid pooling (MSPP) module to Nervous
Pleased

effectively capture the abundant information in multimodal -R\ $QJHU Content Valence
Sad Relaxed
fusion features [30]. By employing HCSFNet, we achieve an SADNESS NEUTRAL
Calm
efficient fusion of EEG and video features, enabling seamless Depressed Bored
/RYH 6XUSULVH Sleepy
intermodal information exchange and attaining a state-of-the-
art performance on the DEAPand MAHNOB-HCI datasets. 7UXVW
The primary contributions of this paper can be summarized as (a) (b)
follows, Fig. 1. Illustrations of two methods of emotional representation. (a) The
Emotion Wheel, showing the relationships and intensities of different emo-
1) We introduce PMNet, an EEG feature extraction net- tional states, where a position closer to the center of the wheel indicates lower
work using 1D convolution and attention mechanisms, intensity, and vice versa for higher intensity. (b) The V-A dimensional model,
where the range of V-A extends from unhappy/sleepy to happy/excited.
and VMNet, a video feature network employing 3D
convolution for comprehensive spatio-temporal feature
extraction.
2) A MSPP module is designed to effectively capture crit- B. Attention-based PMB Module
ical information from EEG and video modalities across EEG data, characterized by its one-dimensional signal with
spatial and temporal scales. This module enhances the temporal information [32], benefits from the attributes of 1D
network’s ability to integrate the fused features more convolution, which align well with its temporal structure.
effectively. This property enables the precise capture of critical time
3) The HCSFNet is proposed through the integration of points or intervals within the signal [33]. Additionally, 1D
PMNet and VMNet, combined with the dynamic atten- convolution offers advantages in terms of reducing unneces-
tion interaction capabilities of the HCCA module and the sary parameters, reducing model complexity, and highlighting
multi-scale information fusion capabilities of the MSPP temporal relationships. Lawhern et al. introduced EEGNet, the
module. first network to employ 2D convolutional networks for EEG
4) To the best of our knowledge, it is the first study to signal recognition tasks [34]. Kuang et al. further enhanced

Authorized licensed use limited to: JADAVPUR UNIVERSITY. Downloaded on January 11,2025 at 09:33:30 UTC from IEEE Xplore. Restrictions apply.
© 2024 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Transactions on Artificial Intelligence. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TAI.2024.3523250

IEEE TRANSACTIONS ON ARTIFICIAL INTELLIGENCE 3

EEGNet by introducing the KAM attention module, which C. STSC Module Based on 3D Convolution
led to more accurate emotion classification [35]. The MCAM Recent studies have increasingly utilized facial expressions
efficient attention module [36] by dynamically incorporating in the field of emotion recognition. Ma et al. introduced
an attention matrix. a visual transformer algorithm that relies on feature fusion,
This paper considers the 1D temporal characteristics of combining LBP and CNN features to enhance expression
EEG, and 1D convolution is employed for constructing the recognition accuracy [41]. Zhao et al. developed a multi-scale
feature extraction network for emotion recognition based on transformer network to learn micro-expression features [42].
EEG, given the attention module’s ability to discriminate and Additionally, Song et al. innovatively incorporated attribute in-
capture key features within EEG signals, thereby enhancing formation into micro-expression recognition while optimizing
the network’s representational capacity and robustness in com- the network using cross-modal comparison learning [43].
plex tasks with noise interference [37]–[39]. The SE attention Given that 3D convolution has demonstrated numerous
module [40] is utilized in this paper for EEG data analysis. advantages in processing video data [44], such as the ability to
Specifically, it replaces the existing 2D convolution with 1D model both time and space simultaneously and the capacity to
convolution and names it SE-1D. capture spatial context [45], designing a network based on
The paper introduces the Physiological Model Block (PMB) 3D convolution for a multimodal emotion recognition task
structure to fully extract features from EEG signals, as de- offers benefits. This approach enables efficient processing of
picted in Fig. 2. This structure divides the input data into two video data, facilitates the extraction of critical spatio-temporal
essential components based on channel-wise characteristics. features, and provides a comprehensive understanding of the
One component is processed through two CBPR modules, dynamic representation of video data and the neural activity
as illustrated in Fig. 2, using SE-1D for fundamental feature associated with the temporal sequence. To enhance the effi-
extraction from the raw data. The other component undergoes ciency of the 3D convolution operation, the STSC module is
processing through a single CBPR module. In this process, by introduced, as depicted in Fig. 3.
controlling different gradient flows, the structure effectively In contrast to traditional 3D convolution kernels (3, 3, 3), the
captures key information within the data and prevents infor- STSC utilizes two sets of smaller kernels, namely (3, 3, 1) and
mation loss. (1, 3, 3). This approach fully capitalizes on the separable nature
of spatio-temporal features in video data. The advantages of
Recognizing that simple concatenation might result in in-
this technique include a reduction in the network’s parameters
sufficient information exchange within subsequent networks,
and computational costs, as well as the prevention of overfit-
the paper employs a channel shuffle operation, followed by
ting in deep networks. Consequently, it reduces computational
feature re-aggregation into an additional CBPR module for
redundancy and enhances efficiency.
further feature extraction and enhancement. This design allows
the network to more accurately capture abstract features within
EEG data, thereby improving its overall representational ca- D. Multi-scale Space Pyramid Pool: MSPP
pability. In the task of emotion classification using EEG and video
data, considering the temporal and multi-frequency character-
B×C×L B×C×L
istics of EEG data, as well as the complexity of fusion features,
Input Output the Multi-Scale Pooling Pyramid (MSPP), illustrated in Fig.
7, was introduced. MSPP incorporates multiple consecutive
B×C/2×L B×C/2×L
pooling layers, effectively capturing critical information across
B×C×L different frequency ranges and temporal scales within the EEG
CBPR×2 data. This enhances the network’s capacity to combine features
CBPR B×C/2×L CBPR fused from both modalities, thereby improving its sensitivity
SE-1D to emotion-related features. Furthermore, by merging features
B×C/2×L from various receptive fields, MSPP offers advantages in
adapting to individual differences, comprehensively capturing
B×C×L Channel temporal patterns, and integrating complementary information.
Concat
Shuffle Enhancing the discriminative power and generalization perfor-
(a) mance of the network is facilitated by such contributions.

CBPR = Conv1D BN PReLU


E. Cross-Modal Fusion Attention
(b) A Layered Cross Attention (LCA) module is designed,
as shown in Fig. 4. The Tanh activation function enhances
Fig. 2. Components and architecture of key modules in the network. (a)
is the structure of the PMB module. (b) is the CBPR module composed nonlinear expression, and the iterative looping mechanism
of Convolution (Conv), Batch Normalization (BN), and Parametric Rectified dynamically adjusts features weights, thereby improving the
Linear Unit (PReLU). B denotes batch size, C denotes channel, and L denotes accuracy of key features recognition. Utilizing a layered pro-
length.
cessing strategy, we have further developed the Hierarchical
Cross-Modal Attention (HCMA) based on LCA, as shown in

Authorized licensed use limited to: JADAVPUR UNIVERSITY. Downloaded on January 11,2025 at 09:33:30 UTC from IEEE Xplore. Restrictions apply.
© 2024 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Transactions on Artificial Intelligence. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TAI.2024.3523250

IEEE TRANSACTIONS ON ARTIFICIAL INTELLIGENCE 4

5 Input Size W=
H= 5 Spatial Kernel=1×3×3
Mid Feature Temporal Kernel=3×1×1
stride=1 padding=(0,1,1) Output Size
stride=1 padding=(1,0,0)

3 W= 3 W=
H= 3 KT=3 H= 3
K
=3 W=
3 3×3
T=5 KH
T=3 T=3
KT=1

Fig. 3. STSC Schematic.

B×C×Lp

Physiological Feature
Loop in Batch MSPP Max_Pooling B×C/2×L
B×C×Lvp
k3, s1, p1

Tanh Sigmoid Output


Max_Pooling B×C/2×L
k3, s1, p1
Video Feature
B×C×Lv
Max_Pooling B×C/2×L
k3, s1, p1

Fig. 4. LCA Module. LCA consists of a layered attention framework with


the Tanh function and the iterative mechanism. Input
CBPR
Concat
CBPR
Output
k1, s1, C/2 k1, s1, C
B×C×L B×C/2×L B×2C×L B×2C×L B×C×L
B×C/2×Lv1

Video Feature1 Xv1 B×C/2×Lvp1

LCA Output1
B×C×Lp B×C/2×Lp1

Physiological Feature Physiological Feature1 Xp1 B×C/2×Lvp


Fig. 7. MSPP Module. The MSPP module follows a specific sequence of
Channel Split Concat
Channel
Output
operations. Initially, it goes through a CBPR module, followed by a series
B×C/2×Lp2 Shuffle

Video Feature Physiological Feature1 Xp2


of Max Pooling layers to acquire features at various scales. Subsequently,
B×C×Lv it concatenates these feature maps along the channel dimension, resulting in
LCA Output2
B×C/2×Lv2
B×C/2×Lvp2
an output feature map that consolidates information from multiple receptive
Video Feature1 Xv2
fields. Finally, this combined feature map undergoes another round of process-
ing through a CBPR module for additional feature extraction. In the context
Fig. 5. HCMA Module. HCMA is developed from LCA and integrates the of these operations, B represents the batch size, C represents the number of
layered strategy with Channel Split. channels, L represents the length, k represents the batch size, and s represents
the stride.

B×C/2×Lv1
B×C/2×Lvp1
LCA
B×C×Lp

Physiological Feature
Channel
Split
B×C/2×Lp1

B×C/2×Lp2
Concat
Channel
Shuffle
reduction of video features, followed by the reinforcement of
B×C/2×Lv2
LCA
B×C/2×Lvp2 B×C×L'vp B×C×L'vp key channel features via the SE mechanism, heightening the
B×C×L'vp
Output
model’s sensitivity to task-relevant features.
B×C×Lp

Video Feature Linear SENet Concat Linear


Channel F. Self-Distillation
Shuffle
B×C×Lv B×C×Lp B×C×Lp B×2C×Lp B×C×L'vp B×C×L'vp A self-distillation approach is introduced based on Channel-
wise Knowledge Distillation (CWD) [46]. Channel-level
Fig. 6. HCCA Module. HCCA is developed from HCMA and integrates the knowledge distillation focuses on the activation maps of each
SE mechanism. channel, converting them into probability distributions, and
minimizing the difference between these probability distribu-
Fig. 5, which employs channel split to mitigate interference tions for the teacher and student models. The variable y de-
between modalities. Its hierarchical structure facilitates the notes the output characteristics of the network. c = 1, 2, · · · , C
integration of multi-scale features, while feature reorganization indexes the channel and i indexes the spatial location of a
and Channel Shuffle operations enhance cross-modal feature channel. The teacher and student are denoted as T and S with
interaction. activation mappings yT from T and y S from S. ϕ(·) is used
A Hierarchical Cross-Modal Coordinated Attention to convert the activation values into probability distributions,
(HCCA) is proposed based on the HCMA, as shown in Fig. where τ is a hyperparameter, known as the temperature. The
6. By integrating the Squeeze-and-Excitation (SE) attention details are shown in Eq. (1).
(y )
mechanism, the recognition capabilities of multimodal exp c,τ i
features are further enhanced. It refines the channel split ϕ (yc ) = ∑W H ( yc, i ) (1)
from HCMA to minimize interference between modalities i=1 exp τ

and achieves cross-level feature integration through its KL Divergence is employed to evaluate the distribution
layered structure. Utilizing a Linear layer for dimensionality discrepancy between the teacher and student features, as

Authorized licensed use limited to: JADAVPUR UNIVERSITY. Downloaded on January 11,2025 at 09:33:30 UTC from IEEE Xplore. Restrictions apply.
© 2024 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Transactions on Artificial Intelligence. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TAI.2024.3523250

IEEE TRANSACTIONS ON ARTIFICIAL INTELLIGENCE 5

shown in Eq. (2). TABLE I


( ) T HE S TRUCTURE OF PMN ET.
( ) τ2 ∑ ∑ ( T )
C WH ϕ yTc,i
φ y ,y T S
= ϕ yc,i log ( ) (2) Layer Output Options
C c=1 i=1 ϕ yc,i
S
Input B × 8 × 50 –
L-1 B × 64 × 25 CBPR, k = 7, s = 2
The loss function is shown as Eq. (3), consists of two L-2 B × 64 × 12 Max Pooling1d, k = 2, s = 2
components: one is the cross-entropy loss based on true labels, L-3 B × 64 × 6 PMB, Max Pooling1d, k = 2, s = 2
denoted as Lcls , and the other is the distillation loss derived L-4 B × 64 × 3 PMB, PMB, s = 2
Head B× NC Avg Pooling1d, FC, sigmoid
from the features of the teacher model, denoted as LKD . The
weighted sum of these two components constitutes the total Notes: Layer indicates the name of each layer; Output refers to the shape of
loss, which is optimized by the student model [47]. the output feature map for each layer, where B represents batch size, and
NC is the number of classes; Options details the configurations for each
Ltotal = Lcls + αLKD (3) layer, with k and s denoting kernel size and stride respectively, followed
by integers specifying their dimensions; FC stands for fully connected
To balance the two losses, a hyperparameter α is introduced. layer; and σ represents the sigmoid activation function.
In the early stages of training, the soft labels from the
B×C/2×T×H×W
teacher model are more readily learnable, and the setting of α STSC×2
enables the student model to assimilate knowledge quickly. As B×C×T×H×W B×C×T×H×W B×C×T×H×W B×C×T×H×W

training progresses and the performance of the student model Input Concat STSC Output
B×C/2×T×H×W
increasingly approximates that of the teacher model, the hard STSC
labels will aid in more nuanced learning for the student. To
control the impact of these two types of losses more precisely, Fig. 8. STSC Module. The input data is first split along the channel dimension
we utilize a cosine annealing schedule to dynamically adjust into two parts. One part uses two STSCs to extract features from the original
image, and the other part uses one STSC for feature extraction. Subsequently,
α. The formula for α can be denoted as: the features from both parts are concatenated along the channel dimension.
( ) Then, another STSC module is used to further extract and enhance the fused
Ei
© 1 − cos π · Emax ª features, resulting in the final output. B denotes batch size, C denotes channel,
α = −0.99 ­­ ®+1
® (4) T denotes depth, H denotes height, W denotes width.
2
« ¬
where Ei represents the current training epoch, and Emax the data is fed through three STSC modules, as depicted in Fig.
represents the maximum number of training epochs. 8, to further extract rich spatio-temporal features. Notably, the
first convolutional layer of the second STSC module employs
III. N ETWORK M ODEL a stride of 2. Finally, a 3D average pooling layer is applied,
followed by flattening, and the output is passed through FC
A. Physiological Model Network: PMNet
layers and a Sigmoid layer to produce the final output of shape
To leverage EEG data for emotion recognition, this paper B× NC, where NC represents the number of emotion classes.
introduces the PMNet architecture, as outlined in Table I.
The input data is B × 8 × 50, where B represents the batch TABLE II
size, 8 denotes the number of channels, and 50 corresponds T HE S TRUCTURE OF VMN ET.
to the feature size. The process begins with the extraction
of physiological features through a CBPR module, utilizing a Layer Output Options
convolutional kernel size of 7 and a stride of 2. Subsequently, Input B × 3 × 16 × 114 × 114 -
a Max Pooling layer with a kernel size of 2 and a stride of L-1 B × 32 × 16 × 57 × 57 Conv3d k(3, 7, 7), s(1, 2, 2)
2 is applied for down-sampling by a factor of two. Next, the L-2 B × 32 × 8 × 28 × 28 Max pooling3d, k = 3, s = 2
L-3 B × 32 × 8 × 28 × 28 STSC, k = 3, s = 1
PMB module is employed for feature extraction and further L-4 B × 32 × 4 × 14 × 14 STSC, k = 3, s = 2
down-sampling. Two additional PMB modules are used for L-5 B × 32 × 4 × 14 × 14 STSC, k = 3, s = 1
extracting feature information, with the first PMB having a Head B× NC Avg Pooling3d, FC, sigmoid
stride of 2. Finally, an Average Pooling layer is applied,
Notes: k and s denote the kernel size and stride, with the subsequent
followed by flattening, and the output is passed through FC parentheses indicating the corresponding sizes for Channel, Height, and
layers and a Sigmoid layer to produce the final output of B× Width dimensions. The integers represent that the sizes of these three
NC, where NC represents the number of emotion classes. dimensions are consistent, and FC stands for fully connected layer, σ
represents the sigmoid activation function.

B. Video Model Network: VMNet


The overall network structure is described in Table II. The C. Hierarchical Cross-Modal Spatial Fusion Net-work:
input data initially undergoes processing through a standard HCSFNet
3D layer with a kernel size of (3, 7, 7) and a stride of (1, 2, 2), Multimodal networks offer several advantages, including fa-
followed by a 3D max pooling layer with a kernel size of cilitating the exchange of complementary information, avoid-
(3, 3, 3) and a stride of (2, 2, 2). These operations are performed ing redundant or noisy data, and achieving more effective
for feature extraction and dimension reduction. Subsequently, feature fusion. They also provide insights into the importance

Authorized licensed use limited to: JADAVPUR UNIVERSITY. Downloaded on January 11,2025 at 09:33:30 UTC from IEEE Xplore. Restrictions apply.
© 2024 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Transactions on Artificial Intelligence. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TAI.2024.3523250

IEEE TRANSACTIONS ON ARTIFICIAL INTELLIGENCE 6

Physiological Data PMNet Physiological Feature k1, s1


Conv
k3, s1
Conv
k1, s1
Conv
AvgPool
BN BN BN
LCA FC
PReLU PReLU PReLU
Channel Concat
Channel
Split Shuffle
LCA

MSPP ECH
Video Data VMNet Video Feature
Sigmoid
Arousal Valence
Channel
Linear SENet Concat
Shuffle

Multi-modal Input Parallel Feature Extraction HCCA Emotion Classification Emotion Results

Fig. 9. HCSFNet Structure. "Concat" refers to the concatenation operation along the channel dimension; ⊕ represents element-wise addition at corresponding
positions; "Sigmoid" denotes the Sigmoid activation function; "FC" stands for Fully Connected Layer.

and contribution of each modality in the target task, thereby activation function, and σ signifies the sigmoid function.
Additionally, the terms W1 ∈ R r ×C , W2 ∈ RC× r , and ⊗
C C
enhancing the interpretability of the fusion process.
Weighted concatenation fusion, a fundamental method for denote the broadcast multiplication operation between tensors
multimodal feature fusion, can merge input features in a and vectors.
relevant manner, thereby improving the utilization of effective ( ) ( ( ( )))
S = σ g(Xv f , W) = σ W2 δ W1 η Xv f ⊗ Xv f (6)
feature information [48]–[50]. However, this approach has its
limitations. Firstly, channel concatenation and element-wise Finally, the video features S and Xp′ are concatenated
addition are straightforward fusion methods that can lead to with the physiological features to produce the output Xc ,
redundancy between different modalities [51]. Secondly, this as described in Eq. (7), which involves the Channel Shuffle
method does not consider dynamic interactions and attention operation. ( ( ))
allocation between modalities, potentially overlooking their Xc = CS Concat S, Xp′ (7)
correlations and complementarity. Lastly, due to different sam-
pling rates among modalities, multimodal data often cannot In the part of HCCA attention module, the physiological
be perfectly aligned. In light of these challenges, this paper Xp′ and video features Xv′ are initially split along the channel
introduces HCSFNet, as depicted in Fig. 9, by proposing dimension, resulting in outputs Xp1 , Xp2 , Xv1 , and Xv2 . These
HCCA to address the shortcomings of weighted concatenation outputs are then processed through the LCA module, resulting
methods. in the production of Xa1 and Xa2 . The final output Xa is
Considering the redundancy in 3D convolutional networks, computed using Eq. (8), which involves the channel shuffle
we introduced a self-distillation method based on VMNet, operation. Finally, the HCCA output features are obtained by
scaling down the width of VMNet to 75%, 50%, and 30% summing the two parts of the features through Eq. (9).
of the original model size. Pre-trained teacher models were
obtained by pre-training VMNet in unimodal settings at these Xa = CS (Concat (Xa1, Xa2 )) (8)
reduced sizes. During the training phase of HCSFNet, we em- XH = Xa + Xc (9)
ployed self-distillation, where the well-trained VMNet teacher
models directly distilled knowledge to the student models Returning to the main network, once the cross-modal at-
before the multimodal features were input to HCCA. tention fusion features are obtained, the MSPP module is
In Fig. 9, physiological data Xp and video data Xv are utilized to extract multi-scale features, enabling the capture of
processed through the PMNet and VMNet networks, respec- contextual information at various scales. Finally, the ensemble
tively, without utilizing the Head layer. This results in the convolution with head (ECH) module is employed to extract
extraction of EEG features Xp′ ∈ RB×C×L p and video features multimodal features following the cross-modal attention fu-
Xv′ ∈ RB×C×Lv . Subsequently, both sets of features from sion. ECH consists of convolutional blocks and FC layers for
different modalities enter the part of HCCA module. The additional feature extraction from the features obtained after
Linear layer, as described by Eq. (5), is initially employed to MSPP, yielding the final classification results. This process is
reduce the dimensionality of the video features Xv′ , producing described in Eq. (10), where Ξ(·) represents the MSPP module,
the output Xv f . φ represents the FC operation, and W ∈ RC Φ(·) represents the ECH module, and σ(·) denotes the sigmoid
indicates the dimension size of W and its consistency with the activation function.
channel C. X = σ (Φ (Ξ (XH )))
( ) (10)
Xv f = φ W Xv′ (5)
With the introduction of the HCCA module, HCSFNet not
The input Xv f is then passed through the SE attention only facilitates the learning of high-level features comple-
module to obtain video features S enhanced with channel mentarily with low-level features to enhance the semantic
attention, as described in Equation (6). During this process, η information of the latter but also leverages the correlation
denotes a 1D average pooling layer, δ represents the PReLU and complementarity between different modalities through

Authorized licensed use limited to: JADAVPUR UNIVERSITY. Downloaded on January 11,2025 at 09:33:30 UTC from IEEE Xplore. Restrictions apply.
© 2024 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Transactions on Artificial Intelligence. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TAI.2024.3523250

IEEE TRANSACTIONS ON ARTIFICIAL INTELLIGENCE 7

dynamic interaction and attention allocation. This design en- Data 32×40×40×8064
Data
18×40×α×β
(Processed)
hances the network’s performance, particularly in scenarios
involving heterogeneous data sources or significant disparities
22×40×40×8064 18×40×40×8064
between modalities. It allows for a more precise capture of Concat
Subjects with Subjects without
intermodal relationships, ultimately improving overall system vidoes vidoes
Case 1: Channel-wise
Case 2: Sample-wise
performance.
Remove missing subjects
IV. E XPERIMENT 18×40×8×8064 720×60×32×128

Valence Arousal
Other data
A. Experimental Setup (Processed) (Processed)

Subjects with
The experiments were conducted using the Python program- vidoes 18×40×32×8064
ming language and the PyTorch deep learning framework, 18×40×40×8064
EEG
starting from scratch. All the training and validation processes
Window Slide
are carried out using dual NVIDIA RTX 2080Ti GPUs. The
the size of window is 128
specific configurations used in the experiments are provided Subtract and eliminate the mean time signal

in Table III.

Normalization
18×40×32×7680 EEG EEG 18×40×32×7680
TABLE III
E XPERIMENTAL S ETTINGS

Fig. 10. EEG Data Processing Flow. It includes the number of subjects
Item Configurations ("Subjects"), data concatenation ("Concat"), α and β are variable parameters
Epoch 50 that change with the method of data concatenation.
Batch Size 128
Loss BCE
Scheduler CosineAnnealingLR C. Experimental Results
Momentum 0.9
Weight Decay 0.0001 In order to verify the effectiveness of different cross-modal
Test Dataset Ratio 0.2 attentions, ablation experiments were conducted based on the
Crop Size [114, 114]
Learning Rate 0.00002
HCSFNet, as shown in Table V. As we previously analyzed,
Weight Initialization Kaiming the network achieved the highest accuracy under HCCA,
Concat Dim Sample-wise reaching 95.82%. Therefore, HCCA is selected as the cross-
modal attention to HCSFNet in subsequent experiments.

B. Data Processing &RPSDULVRQRI&RQFDWHQDWLRQ5HVXOWVLQ7ZR'LPHQVLRQV



When applying HCSFNet to train the DEAP dataset, pre- 
A2
processing of both EEG and video data is essential. The initial 
step involves performing preprocessing on the EEG data, as 
A1
depicted in Fig. 10. 
As mentioned in the official documentation of the DEAP 
$FFXUDF\ 

dataset, some samples lack corresponding video data, and data 

/RVV
from four subjects are missing. Consequently, the EEG data of B1 
these subjects were excluded, retaining only the valence and 
arousal dimension data. Next, within a 63-second video, the 
initial 3 seconds were considered preparation time, and the  6DPSOHZLVH $FFXUDF\
&KDQQHOZLVH $FFXUDF\
data from these 3 seconds were averaged to obtain an average B2
6DPSOHZLVH /RVV 
 &KDQQHOZLVH /RVV
signal. This average signal was then subtracted from the signal
of the remaining 60 seconds. Finally, the valence and arousal      
(SRFK
dimensions were separately normalized and processed using
a sliding window of size 128. The dimensions of the valence Fig. 11. Test curves for two concatenation methods in HCSFNet
and arousal data obtained after processing are 720 × 60 × 32
× 128, respectively. This representation corresponds to (trials As mentioned in the data processing section, the prepro-
× number of subjects) × video duration × number of channels cessed EEG data is divided into two parts: Valence and
× sampling points. Arousal. These two parts need to be concatenated before
The processed valence-arousal (V-A) data are concatenated multimodal training, and there are two concatenation methods:
to obtain complete EEG data, either along the channel or Channel-wise and Sample-wise. This paper conducts experi-
sampling point dimensions. The effects of these different ments on these two concatenation methods, and the results are
concatenation methods are discussed later. For video data, illustrated in Fig. 11.
frame rates are down-sampled to 16 FPS, resized to 114 × As shown in the accuracy curves, it can be observed that
114 pixels, and normalized. Sample-wise outperforms Channel-wise (97.54% vs. 95.82%),

Authorized licensed use limited to: JADAVPUR UNIVERSITY. Downloaded on January 11,2025 at 09:33:30 UTC from IEEE Xplore. Restrictions apply.
© 2024 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Transactions on Artificial Intelligence. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TAI.2024.3523250

IEEE TRANSACTIONS ON ARTIFICIAL INTELLIGENCE 8

TABLE IV
C OMPARISON OF PARAMETERS AND ACCURACY FOR HCSFN ET

Accuracy without Self-Distillation (%) Accuracy with Self-Distillation (%)


HCSFNet FLOPs (M) MAC (M)
Mean Valence Arousal Mean Valence Arousal
HCSFNet (VM1.0) 97.54 97.42 97.66 97.09 97.57 96.62 800.50 400.25
HCSFNet (VM0.75) 97.23 96.96 97.50 97.78 97.20 98.36 493.59 246.79
HCSFNet (VM0.5) 97.07 98.03 96.11 97.77 98.29 97.25 267.81 133.91
HCSFNet (VM0.3) 96.51 96.97 96.05 96.72 98.19 95.95 135.03 67.52

highlighting its superiority in enhancing accuracy.Regarding TABLE VI


the loss curve, both methods initially show no significant P ERFORMANCE COMPARISON ON DEAP DATASET
difference, with a sharp decline in loss values. At area B2, both
methods reach their lowest loss values. However, Channel- Accuracy(%)
Method
wises minimum (0.2229) is significantly higher than Sample- Mean Valence Arousal
wises (0.1822), reaffirming that Sample-wise concatenation 3D CNN [52] 87.44 88.49 87.97
aids in reducing loss and achieving better training outcomes. CNN-LSTM [53] 90.82 86.13 88.48
To enhance the computational efficiency of the network, we Multi-Column CNN [54] 90.01 90.65 90.33
Multi-Layer Perceptron [55] 91.10 91.02 91.06
performed proportional scaling on VMNet along the width CDCN [56] 92.24 92.92 92.58
dimension, According to the scaling ratio, it is represented GANSER [57] 93.52 94.21 93.87
as HCSFNet(VM1.0), HCSFNet(VM0.75), HCSFNet(VM0.5), MLP + CNN [58] 93.53 94.33 93.93
VGG16 + LSTM [59] 94.43 94.85 94.64
and HCSFNet(VM0.3), corresponding to the VMNet scaling SGMCL [15] 94.72 95.68 95.20
ratios of 100%, 75%, 50%, and 30%, as shown in Table Ours 97.78 97.20 98.36
IV. The HCSFNet(VM1.0) achieved FLOPs and MAC values
of 800.50 and 400.25, respectively. Subsequently, by com-
pressing the network along the width dimension, we reduced TABLE VII
P ERFORMANCE COMPARISON ON MAHNOB-HCI DATASET
the FLOPs of HCSFNet(VM0.5) to 267.81, and even further
for HCSFNet(VM0.3) to 135.03. Correspondingly, the MAC
values were reduced to 133.91 and 67.52, respectively. Accuracy(%)
Method
Experiments on HCSFNet with self-distillation based on Mean Valence Arousal
different VMNet sizes were conducted, and the results MGIF [60] 61.05 66.90 55.20
can be found in Table IV. The results indicate that the UBVMT [61] 46.03 42.91 49.14
HCSFNet(VM0.75) trained with self-distillation achieved the HyperFuseNet [27] 42.93 44.30 41.56
highest accuracy. This suggests that a well-designed distilla- Ours 60.59 59.72 61.46
tion strategy can enhance the network’s performance while
reducing the demand for computational resources.
3D CNNs achieved accuracy rates of 87.44% for Valence
TABLE V and 88.49% for Arousal, resulting in an average accuracy of
P ERFORMANCE C OMPARISON ON D IFFERENT C OMPONENTS OF 87.97%, which is a strong performance in emotion recog-
CROSS - MODAL ATTENTIONS
nition. More advanced methods showcasing their potential
advantages, such as CNN-LSTM, Multi-Column CNN, and
Accuracy(%)
Cross-Modal Attentions
Multi-Layer Perceptron surpassed 90% in average accuracy.
Mean Valence Arousal Advanced techniques like CDCN and GANSER exceeded
Channel-wise+LCA 95.42 96.60 94.24 92% in average accuracy, with some even surpassing 93%,
Channel-wise+HCMA 91.20 90.81 93.59 indicating a high level of accuracy in emotion recognition
Channel-wise+HCCA 95.82 95.54 96.10
Sample-wise+LCA 81.05 77.56 84.55
tasks. The proposed HCSFNet outperforms all these methods
Sample-wise+HCMA 96.80 97.04 96.56 in all dimensions, achieving an accuracy of 97.20% in Valence
Sample-wise+HCCA 97.54 97.42 97.66 and 98.36% in Arousal, resulting in an average accuracy of
97.78%. These results demonstrate that HCSFNet reaches a
state-of-the-art level in DEAP emotion recognition tasks.
To further validate the generalization capability of HCSFNet
D. Comparison with Advanced Methods on other datasets, experiments were conducted on the
HCSFNet, as introduced in this paper, demonstrates excep- MAHNOB-HCI dataset, and the results were compared with
tional performance compared with advanced methods in the existing studies. As shown in Table VII, the average accuracy
field of multimodal emotion recognition. This performance of MGIF is 61.05%, and UBVMT is 46.03%. HCSFNet
comparison is detailed in Table VI, focusing on the V-A achieved an average accuracy of 60.59% on the MAHNOB-
dimensions. HCI dataset, comparable to the best existing research results.

Authorized licensed use limited to: JADAVPUR UNIVERSITY. Downloaded on January 11,2025 at 09:33:30 UTC from IEEE Xplore. Restrictions apply.
© 2024 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Transactions on Artificial Intelligence. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TAI.2024.3523250

IEEE TRANSACTIONS ON ARTIFICIAL INTELLIGENCE 9

This demonstrates that the proposed HCSFNet has good [3] Y. Zhang, J. Chen, J. H. Tan, Y. Chen, Y. Chen, D. Li, L. Yang, J. Su,
generalization ability. X. Huang, and W. Che, “An investigation of deep learning models
for eeg-based emotion recognition,” Frontiers in Neuroscience, vol. 14,
The achievement state-of-the-art performance by HCSFNet p. 622759, 2020.
and its outstanding computational efficiency are attributed to [4] G. Assunção, B. Patrão, M. Castelo-Branco, and P. Menezes, “An
its innovative cross-modal feature fusion attention mechanism, overview of emotion in artificial intelligence,” IEEE Transactions on
Artificial Intelligence, vol. 3, no. 6, pp. 867–886, 2022.
the application of self-distillation strategies, and effective [5] G. Yu, “Emotion monitoring for preschool children based on face
optimization of the network structure. By efficiently extracting recognition and emotion recognition algorithms,” Complexity, vol. 2021,
EEG and video modality features using 1D and 3D convolu- pp. 1–12, 2021.
[6] A. Abdulrahman and M. Baykara, “A comprehensive review for emotion
tions and complemented by the HCCA module, the interaction detection based on eeg signals: Challenges, applications, and open
and fusion of information between modalities are significantly issues.,” Traitement du Signal, vol. 38, no. 4, 2021.
enhanced. This strengthens the networks ability to capture key [7] M. Khateeb, S. M. Anwar, and M. Alnowami, “Multi-domain feature
fusion for emotion classification using deap dataset,” Ieee Access, vol. 9,
information across different spatial and temporal scales. pp. 12134–12142, 2021.
Although HCSFNet has demonstrated excellent perfor- [8] W. Hu, G. Huang, L. Li, L. Zhang, Z. Zhang, and Z. Liang, “Video-
mance on the DEAP and MAHNOB-HCI datasets, these triggered eeg-emotion public databases and current methods: a survey,”
Brain Science Advances, vol. 6, no. 3, pp. 255–287, 2020.
datasets are primarily derived from adults, which limits the [9] Z. Zhang, G. Chen, and S. Chen, “A support vector neural network
applicability to other age groups. Cross-modal attention mech- for p300 eeg signal classification,” IEEE Transactions on Artificial
anisms may exhibit biases when dealing with data from dif- Intelligence, vol. 3, no. 2, pp. 309–321, 2022.
[10] J. Chen, T. Ro, and Z. Zhu, “Emotion recognition with audio, video,
ferent cultural backgrounds, and the limited size and number eeg, and emg: a dataset and baseline approaches,” IEEE Access, vol. 10,
of subjects in the relevant datasets further constrain appli- pp. 13229–13242, 2022.
cability. Therefore, further research is needed to enhance [11] M. Wu, W. Teng, C. Fan, S. Pei, P. Li, and Z. Lv, “An investigation
of olfactory-enhanced video on eeg-based emotion recognition,” IEEE
the generalization and scalability of HCSFNet. Future work Transactions on Neural Systems and Rehabilitation Engineering, vol. 31,
will involve training HCSFNet with datasets from different pp. 1602–1613, 2023.
age and cultural backgrounds, employing data augmentation [12] S. Poria, E. Cambria, R. Bajpai, and A. Hussain, “A review of affective
computing: From unimodal analysis to multimodal fusion,” Information
and balanced sampling, and incorporating fairness constraints Fusion, vol. 37, pp. 98–125, 2017.
to improve generalization capabilities. Additionally, domain [13] S. M. S. A. Abdullah, S. Y. A. Ameen, M. A. Sadeeq, and S. Zeebaree,
adaptation will be explored to enhance the scalability of “Multimodal emotion recognition using deep learning,” Journal of
Applied Science and Technology Trends, vol. 2, no. 01, pp. 73–79, 2021.
HCSFNet. [14] N. Ahmed, Z. Al Aghbari, and S. Girija, “A systematic survey on
multimodal emotion recognition using learning algorithms,” Intelligent
Systems with Applications, vol. 17, p. 200171, 2023.
V. C ONCLUSION [15] H. Kan, J. Yu, J. Huang, Z. Liu, H. Wang, and H. Zhou, “Self-supervised
group meiosis contrastive learning for EEG-based emotion recognition,”
In this study, HCSFNet, a novel network for multimodal Applied Intelligence, vol. 53, pp. 27207–27225, nov 2023.
emotion recognition, is proposed. Initially, PMNet and VMNet [16] M. A. Ozdemir, M. Degirmenci, E. Izci, and A. Akan, “Eeg-based emo-
are developed for EEG feature extraction through 1D convo- tion recognition with deep convolutional neural networks,” Biomedical
Engineering/Biomedizinische Technik, vol. 66, no. 1, pp. 43–57, 2021.
lution and video feature extraction through 3D convolution [17] Y. Yin, X. Zheng, B. Hu, Y. Zhang, and X. Cui, “Eeg emotion
respectively. Subsequently, HCCA is designed for correlation recognition using fusion model of graph convolutional neural networks
information capture and feature integration in modalities, and lstm,” Applied Soft Computing, vol. 100, p. 106954, 2021.
[18] A. S. Rajpoot, M. R. Panicker, et al., “Subject independent emotion
while MSPP is implemented for extracting key information recognition using eeg signals employing attention driven neural net-
from complex features. Furthermore, self-distillation is in- works,” Biomedical Signal Processing and Control, vol. 75, p. 103547,
troduced in the training process of HCSFNet, resulting in a 2022.
[19] J.-Y. Guo, Q. Cai, J.-P. An, P.-Y. Chen, C. Ma, J.-H. Wan, and Z.-K.
smaller model with higher accuracy, offering significant ben- Gao, “A transformer based neural network for emotion recognition and
efits in environments with constrained computing resources. visualizations of crucial eeg channels,” Physica A: Statistical Mechanics
Experimental results suggest that HCSFNet excels at capturing and its Applications, vol. 603, p. 127700, 2022.
[20] K. Ezzameli and H. Mahersia, “Emotion recognition from unimodal to
information from different modalities, significantly improving multimodal analysis: A review,” Information Fusion, p. 101847, 2023.
emotion recognition performance, and providing a novel solu- [21] Y. Zhang, C. Cheng, and Y. Zhang, “Multimodal emotion recognition
tion in multimodal emotion recognition. This study indicates based on manifold learning and convolution neural network,” Multimedia
Tools and Applications, vol. 81, no. 23, pp. 33253–33268, 2022.
that the proposed network has immense potential in areas such [22] Y. Zhang, C. Cheng, S. Wang, and T. Xia, “Emotion recognition using
as mental health monitoring and driving safety. However, the heterogeneous convolutional neural networks combined with multimodal
generalization limitation of the network may restrict its effec- factorized bilinear pooling,” Biomedical Signal Processing and Control,
vol. 77, p. 103877, 2022.
tiveness. How to further enhance the generalization capability [23] S. Chen, J. Tang, L. Zhu, and W. Kong, “A multi-stage dynamical fusion
of the proposed network to meet the demands will be a focal network for multimodal emotion recognition,” Cognitive Neurodynam-
point of the subsequent research. ics, vol. 17, no. 3, pp. 671–680, 2023.
[24] Y. Wang, S. Qiu, D. Li, C. Du, B.-L. Lu, and H. He, “Multi-modal
domain adaptation variational autoencoder for eeg-based emotion recog-
R EFERENCES nition,” IEEE/CAA Journal of Automatica Sinica, vol. 9, no. 9, pp. 1612–
1626, 2022.
[1] F. Wang, S. Wu, W. Zhang, Z. Xu, Y. Zhang, C. Wu, and S. Coleman, [25] B. Fu, C. Gu, M. Fu, Y. Xia, and Y. Liu, “A novel feature fusion net-
“Emotion recognition with convolutional neural network and eeg-based work for multimodal emotion recognition from eeg and eye movement
efdms,” Neuropsychologia, vol. 146, p. 107506, 2020. signals,” Frontiers in Neuroscience, vol. 17, 2023.
[2] F. Najar and N. Bouguila, “Smoothed generalized dirichlet: A novel [26] Y. Wu and J. Li, “Multi-modal emotion identification fusing facial
count-data model for detecting emotional states,” IEEE Transactions on expression and eeg,” Multimedia Tools and Applications, vol. 82, no. 7,
Artificial Intelligence, vol. 3, no. 5, pp. 685–698, 2022. pp. 10901–10919, 2023.

Authorized licensed use limited to: JADAVPUR UNIVERSITY. Downloaded on January 11,2025 at 09:33:30 UTC from IEEE Xplore. Restrictions apply.
© 2024 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Transactions on Artificial Intelligence. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TAI.2024.3523250

IEEE TRANSACTIONS ON ARTIFICIAL INTELLIGENCE 10

[27] E. Lopez, E. Chiarantano, E. Grassucci, and D. Comminiello, “Hyper- Proceedings of the 18th ACM International Conference on Multimodal
complex multimodal emotion recognition from eeg and peripheral phys- Interaction, pp. 284–288, 2016.
iological signals,” in 2023 IEEE International Conference on Acoustics, [49] H. Wang, A. Meghawat, L.-P. Morency, and E. P. Xing, “Select-additive
Speech, and Signal Processing Workshops (ICASSPW), pp. 1–5, 2023. learning: Improving generalization in multimodal sentiment analysis,” in
[28] R. Wang, W. Jo, D. Zhao, W. Wang, A. Gupte, B. Yang, G. Chen, and B.- 2017 IEEE International Conference on Multimedia and Expo (ICME),
C. Min, “Husformer: A multi-modal transformer for multi-modal human pp. 949–954, IEEE, 2017.
state recognition,” IEEE Transactions on Cognitive and Developmental [50] Y. Ma, Y. Hao, M. Chen, J. Chen, P. Lu, and A. Koir, “Audio-visual
Systems, 2024. emotion fusion (avef): A deep efficient weighted approach,” Information
[29] Z. Jia, Y. Lin, J. Wang, Z. Feng, X. Xie, and C. Chen, “Hetemotionnet: Fusion, vol. 46, pp. 184–192, 2019.
two-stream heterogeneous graph recurrent neural network for multi- [51] H. Wen, S. You, and Y. Fu, “Cross-modal dynamic convolution for
modal emotion recognition,” in Proceedings of the 29th ACM Inter- multi-modal emotion recognition,” Journal of Visual Communication
national Conference on Multimedia, pp. 1047–1056, 2021. and Image Representation, vol. 78, p. 103178, 2021.
[30] K. He, X. Zhang, S. Ren, and J. Sun, “Spatial pyramid pooling in deep [52] E. S. Salama, R. A. El-Khoribi, M. Shoman, and M. A. W. Sha-
convolutional networks for visual recognition,” IEEE Transactions on laby, “Eeg-based emotion recognition using 3d convolutional neural
Pattern Analysis and Machine Intelligence, vol. 37, no. 9, pp. 1904– networks,” International Journal of Advanced Computer Science and
1916, 2015. Applications, vol. 9, 2018.
[31] J. Cheng, M. Chen, C. Li, Y. Liu, R. Song, A. Liu, and X. Chen, [53] Y. Yang, Q. Wu, M. Qiu, Y. Wang, and X. Chen, “Emotion recognition
“Emotion recognition from multi-channel eeg via deep forest,” IEEE from multi-channel eeg through parallel convolutional recurrent neural
Journal of Biomedical and Health Informatics, vol. 25, no. 2, pp. 453– network,” in 2018 international joint conference on neural networks
464, 2020. (IJCNN), pp. 1–7, IEEE, 2018.
[32] Z. Jia, X. Cai, G. Zheng, J. Wang, and Y. Lin, “Sleepprintnet: A [54] H. Yang, J. Han, and K. Min, “A multi-column cnn model for emotion
multivariate multimodal neural network based on physiological time- recognition from eeg signals,” Sensors, vol. 19, no. 21, p. 4736, 2019.
series for automatic sleep staging,” IEEE Transactions on Artificial [55] S. Marjit, U. Talukdar, and S. M. Hazarika, “Eeg-based emotion recogni-
Intelligence, vol. 1, no. 3, pp. 248–257, 2020. tion using genetic algorithm optimized multi-layer perceptron,” in 2021
[33] Y. Sun, F. P.-W. Lo, and B. Lo, “Eeg-based user identification system us- International Symposium of Asian Control Association on Intelligent
ing 1d-convolutional long short-term memory neural networks,” Expert Robotics and Industrial Automation (IRIA), pp. 304–309, IEEE, 2021.
Systems with Applications, vol. 125, pp. 259–267, 2019. [56] Z. Gao, X. Wang, Y. Yang, Y. Li, K. Ma, and G. Chen, “A channel-fused
[34] V. J. Lawhern, A. J. Solon, N. R. Waytowich, S. M. Gordon, C. P. Hung, dense convolutional network for eeg-based emotion recognition,” IEEE
and B. J. Lance, “Eegnet: a compact convolutional neural network for Transactions on Cognitive and Developmental Systems, vol. 13, no. 4,
eeg-based brain–computer interfaces,” Journal of neural engineering, pp. 945–954, 2020.
vol. 15, no. 5, p. 056013, 2018. [57] Z. Zhang, S.-h. Zhong, and Y. Liu, “Ganser: A self-supervised data
[35] D. Kuang and C. Michoski, “Kam-a kernel attention module for emotion augmentation framework for eeg-based emotion recognition,” IEEE
classification with eeg data,” in International Workshop on Interpretabil- Transactions on Affective Computing, 2022.
ity of Machine Intelligence in Medical Image Computing, pp. 93–103, [58] M. Kumar and M. Molinas, “Human emotion recognition from eeg sig-
Springer, 2022. nals: model evaluation in deap and seed datasets,” in Proceedings of the
[36] D. Kuang, C. Michoski, W. Li, and R. Guo, “A monotonicity constrained First Workshop on Artificial Intelligence for Human-Machine Interaction
attention module for emotion classification with limited eeg data,” in (AIxHMI 2022) co-located with the 21th International Conference of
Workshop on Medical Image Learning with Limited and Noisy Data, the Italian Association for Artificial Intelligence (AI* IA 2022), CEUR
pp. 218–228, Springer, 2022. Workshop Proceedings, CEUR-WS. org, 2022.
[37] Q. Wang, B. Wu, P. Zhu, P. Li, W. Zuo, and Q. Hu, “Eca-net: [59] M. Meng, Y. Zhang, Y. Ma, Y. Gao, and W. Kong, “Eeg-based emotion
Efficient channel attention for deep convolutional neural networks,” recognition with cascaded convolutional recurrent neural networks,”
in Proceedings of the IEEE/CVF conference on computer vision and Pattern Analysis and Applications, vol. 26, no. 2, pp. 783–795, 2023.
pattern recognition, pp. 11534–11542, 2020. [60] S. Rayatdoost, D. Rudrauf, and M. Soleymani, “Multimodal gated
[38] Z. Niu, G. Zhong, and H. Yu, “A review on the attention mechanism of information fusion for emotion recognition from eeg signals and facial
deep learning,” Neurocomputing, vol. 452, pp. 48–62, 2021. behaviors,” in Proceedings of the 2020 International Conference on
[39] M. Kolahdouzi, A. Sepas-Moghaddam, and A. Etemad, “Facetoponet: Multimodal Interaction, pp. 655–659, 2020.
Facial expression recognition using face topology learning,” IEEE Trans- [61] K. Ali and C. E. Hughes, “A unified transformer-based network for
actions on Artificial Intelligence, vol. 4, no. 6, pp. 1526–1539, 2023. multimodal emotion recognition,” arXiv preprint arXiv:2308.14160,
[40] J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” in 2023.
Proceedings of the IEEE conference on computer vision and pattern
recognition, pp. 7132–7141, 2018.
[41] F. Ma, B. Sun, and S. Li, “Facial expression recognition with visual
transformers and attentional selective fusion,” IEEE Transactions on
Affective Computing, 2021.
[42] Y. Zhao, X. Cao, J. Lin, D. Yu, and X. Cao, “Multimodal affective states
recognition based on multiscale cnns and biologically inspired decision
fusion model,” IEEE Transactions on Affective Computing, 2021.
[43] Y. Song, J. Wang, T. Wu, Z. Huang, and J. Xiao, “Micro-expression
recognition based on attribute information embedding and cross-modal
contrastive learning,” in 2022 International Joint Conference on Neural
Networks (IJCNN), pp. 1–7, IEEE, 2022.
[44] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, “Learning
spatiotemporal features with 3d convolutional networks,” in Proceedings
of the IEEE international conference on computer vision, pp. 4489–
4497, 2015.
[45] Y. Cao, W. Zhou, M. Zang, D. An, Y. Feng, and B. Yu, “Mbanet: A
3d convolutional neural network with multi-branch attention for brain
tumor segmentation from mri images,” Biomedical Signal Processing
and Control, vol. 80, p. 104296, 2023.
[46] C. Li, L. Li, Y. Geng, H. Jiang, M. Cheng, B. Zhang, Z. Ke, X. Xu,
and X. Chu, “Yolov6 v3. 0: A full-scale reloading,” arXiv preprint
arXiv:2301.05586, 2023.
[47] C. Shu, Y. Liu, J. Gao, Z. Yan, and C. Shen, “Channel-wise knowledge
distillation for dense prediction,” in Proceedings of the IEEE/CVF
International Conference on Computer Vision, pp. 5311–5320, 2021.
[48] B. Nojavanasghari, D. Gopinath, J. Koushik, T. Baltrušaitis, and L.-P.
Morency, “Deep multimodal fusion for persuasiveness prediction,” in

Authorized licensed use limited to: JADAVPUR UNIVERSITY. Downloaded on January 11,2025 at 09:33:30 UTC from IEEE Xplore. Restrictions apply.
© 2024 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.

You might also like