Comparative_Analysis_of_Machine_Learning_Approaches_for_Emotion_Recognition_Using_EEG_and_ECG_Signals
Comparative_Analysis_of_Machine_Learning_Approaches_for_Emotion_Recognition_Using_EEG_and_ECG_Signals
Abstract—Emotions significantly influence human behaviour Without human affect, an Artificial Intelligence (AI) or
and decision-making, particularly in a digital era dominated AI-based system has a communicative gap between itself
by human-computer interactions (HCIs). Emotion can be ex- and its human collaborators. In human-computer interaction,
pressed in various forms, including facial expressions, textual
descriptions, and physiological responses. The main objective computers often struggle to accurately recognize or respond to
of this study is to comparatively analyze the performance of human emotional states. Consequently, lacking the ability to
various machine learning (ML) classifiers to accurately recognize understand the semantics of contextual information, an intel-
human emotional states using electroencephalogram (EEG) and ligent system cannot make a proper decision, take appropriate
electrocardiogram (ECG) signals. This study uses the DREAMER actions, and produce empathy with the collected information.
dataset and classifies emotional state in four different ways
according to valence, arousal, and dominance (VAD) values To bridge this communicative gap and to enhance the intel-
– binary emotions, positive-neutral-negative (PNN) emotions, ligence of an AI-based machine, developing and building an
two-dimensional valence-arousal emotional space, and three- advanced emotion-aware interactive system is crucial in the
dimensional VAD emotional space. An ML pipeline has been intelligent HCI field.
developed to detect human emotions with EEG and ECG sig- Emotion has a significant impact on our lives, especially in
nals. Without removing outliers and balancing the dataset, the
classifier that achieved the best performance was the ensemble the digital age. Emotion recognition tools are in high demand,
classifier (SVM + random forest). If emotion is defined as a as people interact with various forms of online content. Social
binary state, our experimental results show that both the SVM media plays a central role in triggering emotional responses
and the ensemble classifiers strike a good performance with [2]. Human emotions encompass both physiological and phys-
approximately 80% accuracy; however, they perform poorly ical dimensions. These physical aspects, including facial ex-
with the non-binary emotional models. The multinomial logistic
regression (MLR) classifier and the random forest (RF) classifier pressions, and eye movements, are observable indicators of
consistently achieve a good performance for both the binary and emotional states. These physical signs are often preferred
the non-binary emotion models with 80% - 90% accuracy, its in research due to their objective and ease of collection.
accuracy is higher than the accuracy in the original DREAMER On the physiological side, emotions can lead to changes in
experimental results. Our study experimentally confirmed this heart rates and hormone levels, however, its signal analysis
obvious finding.
Index Terms—Dreamer, Machine Learning, Emotion Recogni-
can be complex and it is quite challenging for researchers
tion to collect them for analysis as the General Data Protection
Regulation (GDPR) policy; the main attraction for analyzing
emotions with physiological signs is that they are objective, in
I. I NTRODUCTION
other words, it effectively reveals emotional responses during
Humans can verbally or non-verbally express and communi- testing, making it avail of mental healthcare diagnosis [3].
cate their feelings, thoughts, and moods. In other words, emo- The research in human-centric AI for emotional intelligence
tions can be expressed verbally through text or words, alter- has gained tremendous attention in literature. Various ap-
natively, they can be expressed by non-verbal cues like voice, proaches and frameworks have been proposed [4]. Literature
facial expressions, behaviors, and body gestures/postures. Such investigated that the performance of machine learning (ML)
a complicated psychophysiological phenomenon, triggered by classifiers is impacted by defining emotion with a single phys-
either conscious or unconscious perception of external fac- iological signal [5], [6], it is not clear what the relationship is
tors/relations like an object, an environment, a stimulus, or a between the way of defining emotion and the performance of
scenario/situation, is recognized as a human affect [1] ML classifiers with multiple physiological signals [4], [7], [8].
Thus far, we have not found any studies that comparatively an-
alyzed the performance of various ML classifiers to accurately
6867
979-8-3503-8622-6/24/$31.00
Authorized licensed use ©2024 IEEE
limited to: Sungkyunkwan University. Downloaded on March 28,2025 at 13:57:23 UTC from IEEE Xplore. Restrictions apply.
detect human emotional states using multiple physiological and surprise (Fig. 1). In Ekman’s basic emotion model,
signals for different definitions of emotional states. Addressing emotions are fundamental, shared across races and cultures,
this issue, we aim to comparatively analyze the performance of and can be universally understood through specific criteria.
various ML classifiers to detect human emotional states using This model serves as a valuable foundation for detecting
multiple physiological signals accurately. facial emotions. The emotions in this model could be found
A literature review on affective computing, emotional states, in a non-basic compound emotional state. This means that
multimodal emotion datasets, and ML frameworks is outlined this compound state, like guilt, contempt, sarcasm, frustration,
in section II. A fundamental ML pipeline to detect human loneliness, etc. could be directly derived from these basic
emotions with multiple physiological signals is discussed in emotional states. Yet, some concerns have been raised about
section III. The evaluation of the performance of the ML whether Ekman’s model entirely encapsulates the entire range
pipeline is analyzed in section IV. The research findings are of human emotions. Additionally, this model focuses on emo-
concluded in section V. tions prevented in Western cultures. This model fails to model
the continuous emotional changes in real-time [13].
II. L ITERATURE R EVIEW To overcome this challenge, a continuous emotional model
Several survey studies on multimodal emotion recognition is proposed in the psychology community. In such a model, a
and emotion analysis have recently been conducted and pub- dimension represents a characteristic of human affect, and a
lished [9]–[12]. In this section, we mainly focus on existing point in a multi-dimensional emotional space represents an
studies related to emotion recognition with physiological sig- emotional state. Multi-dimensional discrete emotion model
nals, emotion models, and some multimodal emotion databases captures the rich and complex nature of emotions and al-
related to physiological cues. lows for a deeper examination of emotional states. Based
on the principle of a multi-dimensional emotional model,
A. Emotion Recognition with Physiological Signals the Pleasure-Arousal-Dominance (PAD) emotional state model
Physiological signals are biological signals that provide [14] is the most recognizable model in the literature. Like
information about the internal state of an organism. They the PAD model , a two-dimensional valence-arousal emo-
can be used to monitor a person’s health, diagnose, and tional space (see Fig. 2) is defined in [15]. In this two-
understand their emotions. They can provide a more accurate dimensional valence-arousal space, valence refers to the actual
and objective indication of an individual’s emotional state. emotions and arousal denotes the intensity of the emotion;
The challenge of these physiological signals is the impos- the four quadrants in this model are formed based on the
sibility of collecting them in real time. Changes in one’s combination of high/low valence and high/low arousal. This
emotions are induced by environmental factors. These factors two-dimensional valence-arousal space is further extended
are usually accompanied by changes in physiological signals into a three-dimensional valence-arousal-dominance (VAD)
and behavioral signals. The behavioral signals are the external emotional space (see Fig. 3) where dominance denotes the
manifestations triggered by emotions, while the physiological potency or the level of control a person has over a specific
signals are the internal form of the expression to reflect emotion, it ranges from submissive to dominant.
emotions. Prianka categorizes modalities for recognizing emo-
tions into three, including physiological, behavioral, and brain
signals [9]. Many studies have reviewed emotion recognition
using physiological signals in literature. Shu et al [10] com-
prehensively reviewed the published emotional physiological
datasets and the frameworks for emotion recognition based
on physiological signals. Wijasena H.Z. et al [11] provided
a concise review of the analysis of physiological signals,
emotion models, and emotion recognition with physiological
signals in portable devices. Joy E. et al [12] reviewed the
recent advancement in emotion recognition with physiological Fig. 1. Ekman’s six basic emotions [17]
signals including preprocessing, methodologies for feature
extraction, and classification.
6868
Authorized licensed use limited to: Sungkyunkwan University. Downloaded on March 28,2025 at 13:57:23 UTC from IEEE Xplore. Restrictions apply.
clearly expressed [21].
Induced elicitation collects audio recordings of people
expressing a specific emotion from music used to induce
participants’ emotions and measure their breathing patterns.
These audio recordings were elicited with a variety of meth-
ods, including watching emotionally evocative videos, reading
scripts, and listening to music [22]. The self-reported valence
and arousal values were significantly different for the diverse
emotion conditions.
DREAMER dataset [15] is a popular bimodal physiological
database used in human emotion recognition. A participant’s
affective state was elicited by audio and visual stimuli in a
Fig. 2. The two-dimensional valence-arousal emotional space [15] controlled environment. Thus, this is an induced elicitation.
This database simply consists of electroencephalogram (EEG)
and electrocardiogram (ECG) signals. These signals were
directly recorded from 23 participants aged between 22 and
23 years old and their affection was induced by watching 18
different video clips in an isolated environment. The EEG
signal was captured with 14 electrodes. Each affective state
was measured in terms of valence, arousal, and dominance
(VAD), its values ranging between 1 and 5. According to the
rating scale of VAD, the neutral cutoff value is 3. In this
way, three separate binary classes were defined – high/low
valence, high/low arousal, and high/low dominance. Both EEG
and ECG data are provided as a raw imbalanced dataset
without any preprocessing. Both signals were extracted and
pre-processed with MATLAB. After preprocessing and feature
extraction, there were 42 EEG-based features, 71 ECG-based
features, and 414 samples in the dataset. All outliers remained
and all these features were normalized by dividing them by
the corresponding baseline features.
Soleymani et al [23] proposed the MAHNOB-HCI database
Fig. 3. The three-dimensional VAD emotional space [16] and evaluated the performance of SVM with this database.
This imbalanced database encompasses features based on eye
gaze data, EEG, and peripheral physiological signals. The
technology, in-the-wild data is becoming increasingly impor- emotions are elicited in a controlled environment. The elicita-
tant. Controlled data can still be useful for developing and tion of this database is induced and spontaneous elicitation.
evaluating emotion recognition methods, but it is important to Koelstra et al [24] proposed a DEAP database. The EEG,
be aware of its limitations. Multimodal systems that combine ECG, and peripheral physiological signals in this database
information from both audio and video are promising for were collected from 32 participants watching 40 one-minute-
emotion recognition in the wild [18]. long excerpts of music videos in a controlled environment.
Natural elicitation is the most common method of eliciting The elicitation of this database is also induced and sponta-
emotional data. It traditionally involves eliciting emotions, for neous elicitation. They employed this database to evaluate the
instance, by asking participants to watch a video clip or read performance of the Naı̈ve Bayes classifier.
a story. The lack of natural elicitation of emotion during the In summary, various methodologies, open databases, and
development of the simulated emotion databases is one of the frameworks for detecting emotions have been systematically
key issues addressed by Ayadi et al in their comprehensive reviewed and intensively studied in the literature. The existing
review [19]. He reflects that many simulated emotion databases studies and reviews mainly focus on proposing and integrating
are developed by recording emotionally neutral prompts in innovative approaches to enhance the performance of some
diverse emotions by professional or amateur speakers [20]. existing frameworks. To our knowledge, we have not found a
Spontaneous elicitation is another type of speech emotion systematic empirical study to find the best practices for bench-
database created by reading people speaking in real-world sit- marking ML classifiers for multimodal emotion recognition.
uations. This kind of database is more realistic than simulated
full-blown emotion speech databases because emotions are III. M ETHODOLOGY
more likely to be genuine. However, spontaneous databases To conduct a comparative study, an ML pipeline was de-
can be more difficult to analyze because emotions may not be veloped to detect human emotions with multiple physiological
6869
Authorized licensed use limited to: Sungkyunkwan University. Downloaded on March 28,2025 at 13:57:23 UTC from IEEE Xplore. Restrictions apply.
signals. The components of this pipeline are described in the attenuated. According to the calculated bands for each rhythm,
following sections. the overall band for EEG is [0.0625, 0.46875] (i.e. the band
ranges between 4 and 30). Therefore, the Gamma rhythm is
A. Bi-modal Emotion Detection Replicator not applicable in this study. As the lowest and highest bands
To gain a better understanding of the innate correlations for the delta rhythm are below 0, the Delta rhythm can be
among features and modalities in the DREAMER dataset, we neglected. According to the Discrete Fourier Transform (DFT)
first built a simple conventional ML-based bi-modal affection theory [26], the number of taps can be calculated by dividing
recognition framework to replicate the experiments conducted the frequency by the frequency resolution. The optimal number
with the original DREAMER dataset [15] (see Fig. 4) de- of taps for the filter is 9. The best overall frequency resolution
scribed in the previous section. This simple framework was is 14.22Hz. The power spectral density for the theta rhythm is
implemented in Python and PyCM packages. This pipeline 21763.552, the power spectral density for the alpha rhythm is
includes the following processing stages – preprocess raw 18158.231, and the power spectral density for the beta rhythm
signals, remove anomalies, remove outliers, balance dataset, is 2043.311. As the magnitude of the features extracted from
normalize features, and transform emotion classification. the pre-processed EEG signal varies significantly and different
types of features are measured in different units, it is necessary
to normalize those features. Each pre-processed EEG feature
is divided by the corresponding baseline feature.
TABLE I
BANDS FOR THE RHYTHMS RELATED TO EMOTIONS
6870
Authorized licensed use limited to: Sungkyunkwan University. Downloaded on March 28,2025 at 13:57:23 UTC from IEEE Xplore. Restrictions apply.
These additional features determine the intervals of those (i.e. positive emotion and negative emotion); we apply the
PQRST waveforms related to emotional states. Some of these SMOTE technique to balance the dataset if the emotional
ECG features could contain missing values and infinity values. states are defined as non-binary emotions. However, we also
To remove those anomalies, we chose to adopt interpolation experimented with the combination of SMOTE and Near Miss
to impute those anomalies. However, there are three com- version one to balance the dataset. Near Miss version one is a
mon interpolation methods – linear interpolation, polynomial simple under-sampling technique to sample the majority class
interpolation, and padding interpolation. Among these three with a minimum average distance to the three closest minority
interpolation methods, linear interpolation introduces some class samples. After the intensive experiments, we found that
errors. Linear interpolation is polynomial interpolation with Near Miss version one is the most suitable under-sampling
the degree of polynomial 1. Polynomial interpolation is a more technique to combine with SMOTE to balance the DREAMER
precise approach applied to time series data and provides more dataset.
accurate results than linear interpolation. In this experiment, F. Feature Normalization and Emotional State Transformation
we adopt both linear interpolation and polynomial interpola-
tion with degree 3. The threshold for removing anomalies that In the original DREAMER dataset, each participant’s emo-
we chose is 30%. tion is defined in terms of the VAD and the rating scale
for the VAD is a 5-point rating scale [15]. Depending on
D. Outlier Removal the definition of an emotional state, if it is defined as a
The original DREAMER dataset contains some outliers, standard binary emotional state (i.e. positive and negative
and they are not identified and cleaned up in the experiments emotions), the original 5-point rating scale can be unchanged.
conducted with the original DREAMER dataset. Outliers can Even though we take the neutral state into account when
impair the performance of a classifier; thus, it is crucial to we define the final emotional state, we still do not need to
identify the presence of any outliers in a dataset before training change the 5-point rating scale. However, if we define an
a classifier. In this experiment, we calculated the z score emotional state in a three-dimensional VAD space, we need
for each data point. According to the z scores, there are to transform the original VAD scale to a scale between -1
39 EEG features containing outliers, and 33 ECG features and 1. To enhance the performance of a classifier, we usually
containing outliers in the dataset. In this study, we chose to normalize the dataset so that all features of the dataset are
apply the imputation technique to clean up all outliers. Instead on the same scale. There are three common normalization
of imputing extreme values, we winsorized the outliers in techniques – minmax, standardization, and robust scaling. The
the dataset. Winsorizing the outliers involves replacing the robust scaling can only be applied if outliers remain. Thus,
outliers with the nearest non-outlier values. This ensures that we cannot apply robust scaling if all outliers are removed.
the extreme values are still accounted for in the analysis, but If we apply the standardization and robust scaling technique,
their impact is kept to a minimum. The percentage of values we need to manually rescale the VAD values to [-1, 1]. For
at the lower end which are to be winsorized is 0.06 and simplicity, we directly employed the minmax technique to
the percentage of values at the upper end which are to be normalize the dataset and simultaneously transform the VAD
winsorized is 0.23. values.
G. Defined Emotional State
E. Balancing Dataset
The simplest emotional state we can define is a binary
As mentioned in the original DREAMER study, the dataset emotion. If the VAD values are transformed into [-1, 1], the
is balanced for dominance, imbalanced for arousal, and more emotion is positive if and only if the value of valence is greater
imbalanced for valence [15]. For the original dataset sub- and equal to 0; otherwise, it is negative. This binary emotion
jected to EEG, there are 253 samples classified as positive is extended into a PNN emotion; the emotion is positive if
emotion and 161 samples classified as negative emotion. The and only if the value of a valence is greater than 0; the
imbalanced dataset could also affect the performance of a emotion is neutral if the value of a valence is 0; the emotion
classifier. There are three popular techniques that we can apply is negative if the value of a valence is negative. The two-
to balance a dataset – oversampling, under-sampling, and a dimensional valence-arousal emotional space and the three-
hybrid of oversampling and under-sampling. The oversampling dimensional VAD space discussed in the literature review are
technique is simple, however, it only suits a small dataset. utilized in this study.
The under-sampling technique could be biased by the choice
of the majority class. After intensive experimenting with both H. Classifiers and Parameter Settings
techniques, we found that neither of these two approaches can This framework supports the following classifiers: Random,
successfully balance the dataset in silos. Therefore, we com- SVM, LDA, KNN, Naı̈ve Bayes, RF, Decision Tree (DT),
bined the over-sampling technique with the under-sampling ensemble classifier with the combination of SVM and RF,
technique to balance the dataset. Particularly, we apply the and ensemble classifier with the combination of SVM and
combination of Synthetic Minority Over Sampling Technique DT. These classifiers are employed in various experiments.
(SMOTE) and TOMEK techniques to balance the dataset if the The seed of a random generator is 42. The parameter settings
emotional states are defined as the standard binary emotions of each classifier in our framework are presented in table II.
6871
Authorized licensed use limited to: Sungkyunkwan University. Downloaded on March 28,2025 at 13:57:23 UTC from IEEE Xplore. Restrictions apply.
TABLE II TABLE III
T HE PARAMETER SETTING OF EACH CLASSIFIER T HE COMPARISON BETWEEN THE ORIGINAL DREAMER EXPERIMENTS
AND OUR ML PIPELINE EXPERIMENT
Classifier Parameter Settings
SVM kernel = {rbf, linear}, C = {1×10−4 , 1×10−3 , 0.1, 0.9}, γ Setting DREAMER Experiments ML Pipeline Experiments
= {1 × 10− , 5 × 10− , 1 × 10− }
LDA Solver = {lsqr, eigen}, shrinkage = {0.9, 0.95, 0.995, 0.999, Language MATLAB Python
1.0}
KNN K = {3, 5, 7, 9}, metric = minkowski, p = 5.5, weight = EEG Features 42 features 42 features
{uniform, distance}, algorithm = {brute, kd tree, ball tree}, ECG Features 71 features 91 features
leaf size = {1, 2, 3, 4, 5, 6, 10, 16} Sample size after 414 414
Gaussian Var smooth = {2, 4, 4.5, 6, 7} preprocessing
Naı̈ve Preprocess miss- Unknown EEG: 42 features, ECG:
Bayes ing values and in- 83 features, EEG + ECG:
Bernoulli α = {0.1, 0.5.0.9}, binarize = {0.1, 0.5, 0.9} finities 125 features
Naı̈ve Balanced Imbalanced Balaned
Bayes Outliers Contain Removed
RF Min sample leaf = {1, 2}, Number of trees = {1, 2}, Type of Split Split based on video ID Baseline, stratified by
max depth = {1, 2}, max leaf = 2, class weight = video ID, group by video
{balanced, balanced subsample} ID
Logistic Solver = {saga, newton-cg, newton-cholesky, lbfgs, liblinear, Number of Itera- 10 10, 100, 1000
Regres- sag}, C = 1 × 10−3 , n jobs = -1, penalty = {none, ‘l2’} tions
sion Classifiers SVM (kernel = rbf, lin- SVM (kernel = rbf, lin-
DT Splitter = ‘random’, max depth = {1, 2}, min sample leaf ear), KNN (k = 3, 5, 7), ear), KNN (k = 3, 5, 7, 9),
= {1, 2}, min weight fraction leaf = {0.1, 0.01}, LDA, Random LDA, random, logistic re-
max leaf nodes = 2 gression, Gaussian Naı̈ve
Bayes, Bernoulli Naı̈ve
Bays, DT, RF, ensemble
classifier
IV. R ESULTS AND D ISCUSSION Classifier Param- Unknown see Table II
eters
In the original DREAMER experiments, the dataset was
split in a 90:10 ratio between the training set and the test set.
To be consistent with the settings of the original DREAMER
experiments, we also used 90% of the dataset for training and RF, and ensemble classifier with SVM and DT. Standard
the ML classifiers and 10% of the dataset for testing. In our 10-folded cross-validation was repeated 10 times for each
pipeline, there are three ways to split the dataset – the standard type of split and each ML classifier. Except for the ensemble
split, stratifying the dataset based on video ID, and grouping classifiers and the random classifier, the best mean accuracy
the dataset based on video ID. of the other classifiers supported in this pipeline is 69% and
Besides the average accuracy, the metrics for measuring the lowest mean accuracy of all classifiers is 0.54. There are
the performance of a classifier include the average recall, no differences among the three types of splits. In this study,
the average AUC score, the average F1 score, the average we focus on the results of the stratified split. For the random
specificity, and the average precision. The original DREAMER classifier, the mean accuracy is 50% if we perform a standard
experimental results will be used as a benchmark to compare split, the mean accuracy is 54% if we perform a stratified split,
against our results. and the average accuracy is 57% if we perform a group split.
Firstly, we replicated the original DREAMER experiments For the ensemble classifier, the best average accuracy is 74%.
with our pipeline by removing the steps for removing out- Compared to the mean accuracy of the benchmark classifier
liers, balancing data, and normalization. This experimental SVM (62.49%), the best mean accuracy of our classifiers has
result can be used as a benchmark in Python implementation. increased by 6.51% and the best average accuracy of the
Secondly, we experimented with the complete pipeline with ensemble classifiers has increased by 11.51%. The classifiers
various defined emotional states. The comparison of the exper- vary in other metrics. The results for all classifiers with the
imental settings between the original DREAMER experiments stratified split are shown in Table IV. Compared to the mean
and our ML pipeline experiment are summarized in Table III F1 score of the SVM classifier in the benchmark (valence
= 53.05%, arousal = 57.98%, dominance = 61.71%), the
A. ML Pipeline without Balancing and Outlier Removal Re- average F1 score of the corresponding SVM classifier in
sults our experiment has increased by about 0.08. For the average
To align with the original implementation of the original accuracy, the ensemble classifier obtains the highest mean F1
DREAMER experiments, we removed the following standard score and the Gaussian Naı̈ve Bayes obtains the lowest mean
steps in our framework – feature normalization and emotion F1 score. Subject to the average precision, the DT, SVM, and
transformation, outlier detection and removal, dataset balanc- logistic regression achieve the highest mean precision and the
ing, feature normalization, and emotion transformation. In this LDA possesses the lowest mean precision. This means that
experiment, the classifiers we investigated are SVM, LDA, the DT, SVM, and logistic regression can correctly recognize
KNN, Naı̈ve Bayes, RF, DT, ensemble classifier with SVM a binary valence/arousal/dominance around 83% of the time,
6872
Authorized licensed use limited to: Sungkyunkwan University. Downloaded on March 28,2025 at 13:57:23 UTC from IEEE Xplore. Restrictions apply.
yet the LDA can only manage to correctly recognize a binary specificity, highest mean recall, and highest mean F1 score.
valence/arousal/dominance at around 66% of the time. Both The KNN classifier with k = 3 has the lowest mean accuracy
the ensemble classifier and RF have the highest mean recall and the lowest mean recall. The KNN classifier still surpasses
value, the highest mean specificity value, (77%), and the the accuracy of the original DREAMER experimental result
highest mean AUC score; however, Gaussian Naı̈ve Bayes has once the dataset does not contain any outliers and is balanced.
the lowest mean recall value, the lowest specificity value, and The mean F1 scores of all classifiers are higher than the
the lowest AUC score. The worst classifier for this system is mean F1 score of the original DREAMER experimental result
Gaussian Naı̈ve Bayes. The ensemble classifier achieves the (≈ 62%). Compared to the results for the pipeline without
best accuracy and high specificity and precision, its F1 score removing outliers and balancing, the mean accuracy of all
demonstrates that this classifier strikes a fairly good balance classifiers boosted around 15%. This may be because of the
between high precision and high recall. removal of outliers and balancing of the dataset. However, the
FPR of SVM, RF, and MLR is still quite high. It means that
TABLE IV the portion of positive emotions predicted incorrectly by the
T HE PERFORMANCE OF CLASSIFIERS WITHOUT OUTLIERS REMOVAL AND classifier is a bit high. The ensemble classifier is the most
BALANCING
suitable classifier for detecting binary emotion. For the KNN
classifier, when we increase the number of neighbors, the
Classifier Metrics performance improves.
Avg AUC Acg Avg Re- Avg F1 Avg
Acc Preci- call (%) Score Speci-
(%) sion (%) (%) ficity TABLE V
(%) T HE PERFORMANCE OF CLASSIFIERS USED TO DETECT BINARY EMOTIONS
Random 54.05 0.47 48 47 47.41 47
LDA 69.05 0.66 66 66 67 66
RF 69.05 0.74 74 74 69.03 74 Classifier Metrics
DT 69.05 0.62 83 62 58.73 62 Avg AUC Acg Avg Re- Avg F1 Avg
Gaussian 69.05 0.5 69 50 40.85 50 Acc Preci- call (%) Score Speci-
Naı̈ve (%) sion (%) (%) ficity
Bayes (%)
Logistic 69.05 0.66 82 66 62.98 66 3NN 79.04 77.15 81.6 81.6 77.58 18.4
Regres- 5NN 79.04 77.15 81.6 81.6 77.58 18.4
sion 7NN 95.65 95.83 95.83 95.83 95.65 4.17
Bernoulli 69.05 0.68 73 68 66.77 68 SVM 93.69 94.97 93.41 93.41 93.48 6.59
Naı̈ve RF 89.39 87.39 92.25 88.51 89.87 7.75
Bayes MLR 91.3 88.89 93.75 93.75 90.42 6.25
SVM + 73.81 0.77 75 77 73.43 77 SVM + 96.73 96.89 96.86 96.86 96.74 3.14
DT RF
SVM + 73.81 0.77 78 77 73.79 77
RF
SVM 69.05 0.64 82 64 61.08 64
3NN 69.05 0.69 69 69 68.89 69
For non-binary emotional classifications, the MLR clas-
5NN 69.05 0.69 71 69 68.16 69 sifier reaches approximately 86% mean accuracy for PNN
7NN 69.05 0.67 68 67 67.56 67 emotional states but with the lowest precision (Table VI).
9NN 69.05 0.69 68 69 68.16 69 The RF classifier and the MLR classifier manage to have
a smooth performance for all metrics (no metrics with the
lowest values). Thus, these two classifiers are suitable for
B. Complete ML Pipeline for binary emotion and non-binary detecting the PNN emotional states. The SVM classifier has
emotion the lowest specificity, thus, the SVM is the most unsuitable
In light of the strong performance demonstrated by the classifier for detecting the PNN emotional states. RF and
RF and ensemble classifiers in the previous experiment, in MLR classifiers are the most suitable classifiers for detecting
the next set of experiments, we focused on investigating PNN emotion states. Surprisingly, the performance of KNN is
the performance of the following classifiers – KNN, RF, quite similar to the performance of the corresponding one in
ensemble classifier, and MLR classifiers. Standard 10-folded the pipeline without outlier removal and balancing, when we
cross-validation was repeated 1000 times for each type of split increase the number of neighbors, the performance gradually
and each ML classifier. It is evident that both the ensemble enhances. The performance of the ensemble classifier dete-
classifier (SVM with linear kernel and C=0.001 + RF with riorates dramatically, particularly, the accuracy of predicting
setting the parameter values to 5)and the KNN with k = 7 neutral emotion and positive emotion stays around 40% to
excel in binary emotional state recognition with a standard 60% and the accuracy of negative emotion stays between 78%
split, achieving a commendable 96.73% mean accuracy that and 95% according to the observation of the accuracy of each
outperforms the mean accuracy of the original DREAMER class of emotion.
experimental result (≈ 62%) (see Table V). In general, the For the valence-arousal emotion space including neutral
ensemble classifier outperforms the other classifiers with the emotions, both KNN with 3 neighbors and MLR classifiers
highest mean accuracy, highest mean recall, highest mean achieve the highest accuracy, precision, recall, F1 score, and
6873
Authorized licensed use limited to: Sungkyunkwan University. Downloaded on March 28,2025 at 13:57:23 UTC from IEEE Xplore. Restrictions apply.
TABLE VI the performance of the same classifier used to detect emotions
T HE PERFORMANCE OF CLASSIFIERS USED TO DETECT PNN EMOTIONAL in the valence-arousal space.
STATES
TABLE VIII
Classifier Metrics T HE PERFORMANCE OF CLASSIFIERS USED TO DETECT EMOTIONS IN 3-D
Avg AUC Acg Avg Re- Avg F1 Avg VAD SPACE
Acc Preci- call (%) Score Speci-
(%) sion (%) (%) ficity
(%) Classifier Metrics
3NN 59.26 53.06 54.56 80.21 52.96 19.79 Avg AUC Acg Avg Re- Avg F1 Avg
5NN 62.17 58 57.94 82.29 56.76 17.71 Acc Preci- call (%) Score Speci-
7NN 68.65 60.95 63.51 85.03 61.01 14.97 (%) sion (%) (%) ficity
SVM 39.3 52.61 40.54 70.83 33.21 29.17 (%)
RF 82.22 79.93 78.99 91.81 78.41 8.19 3NN 45.65 36.62 44.17 91.27 32.3 8.73
MLR 86.26 83.38 83.01 93.69 82.45 6.31 5NN 47.09 36.76 46.92 91 33.76 8.99
SVM + 58.91 72.53 63.47 81.77 59.13 18.23 7NN 53.35 39.38 48.69 90.37 40.33 9.63
RF SVM 52.65 52.65 52.65 89.49 52.65 10.51
RF 71.65 71.65 68.08 94.37 50.39 5.63
MLR 83.43 68.43 83.43 97.48 75.19 2.52
SVM + 58.39 58.39 58.39 93.07 58.39 6.93
specificity and the lowest FPR (see Table VII). Hence, they RF
are the most suitable classifiers for detecting emotions in
this valence-arousal emotional space. When we increase the
number of neighbors for the KNN classifier, the performance V. C ONCLUSIONS
declines sharply. Both the SVM and the ensemble classifiers
Emotions play a significant role in human decision-making.
perform poorly, they have the lowest accuracy, precision,
Human affect can be expressed subjectively, or objectively,
recall, and F1 score. The performance of SVM is as poor as
or both subjectively and objectively. The subjective emotion
the performance of the corresponding classifier used to detect
can be verbally expressed through text, alternatively, it can
the PNN emotional states.
be non-verbally expressed through voice, facial expression,
TABLE VII
and body gestures. The emotion can be expressed objectively
T HE PERFORMANCE OF CLASSIFIERS USED TO DETECT EMOTIONS IN via physiological signals. Accurately detecting human emotion
VALENCE - AROUSAL SPACE with multimodal data using ML has been actively studied
in the literature. The existing studies focus on proposing or
Classifier Metrics integrating new techniques or frameworks to improve the
Avg AUC Acg Avg Re- Avg F1 Avg performance of the existing systems. Most of these existing
Acc Preci- call (%) Score Speci-
(%) sion (%) (%) ficity frameworks are built with MATLAB. It is not clear the
(%) relationship between the way of defining emotion and the
3NN 72.74 70.93 77.21 90.89 72.79 9.11 performance of ML classifiers with multiple physiological
5NN 57.26 54.43 57.91 85.75 53.77 14.06
signals.
7NN 41.56 52.82 62.72 86.91 45.47 13.09
SVM 39.3 41.09 47.41 84.55 31.92 15.45 In this study, we defined emotional states in four different
RF 71.61 69.91 73.81 90.36 70.24 9.46 ways according to VAD provided in the DREAMER dataset –
MLR 85.04 85.83 88.23 94.68 86.2 5.32 a binary emotion, a PNN emotion, a two-dimensional valence-
SVM + 35.91 34.11 47.39 84.4 32.53 15.6
RF arousal emotion space, and a three-dimensional VAD emotion
space. We implemented a fundamental ML pipeline to detect
human emotions with EEG and ECG signals. This pipeline
However, when dealing with the intricacies of a three- supports signal preprocessing, detecting and removing outliers,
dimensional VAD emotional space including neutral emotion, balancing datasets, normalizing features, and transforming
there is a notable decrease in mean accuracy to around 83% classified emotions. This pipeline is built with PyCM, Neu-
(see Table VIII). Despite this deduction, the performance re- rokit, and Imblearn packages.
mains above the benchmark by 14%, underscoring the efficacy The DREAMER dataset is employed in our study. This
of the classifiers in complex emotional recognition tasks. The dataset is imbalanced and contains outliers. The performance
RF classifier and the MLR classifier are the most suitable of the original DREAMER experiments is used as the bench-
classifiers to detect emotions in the 3-D VAD space, the KNN mark to compare with our ML pipeline. The experimental re-
classifier is not suitable in this scenario. Similar to the PNN sults of our pipeline without outlier removal and data balancing
emotion model, the performance of KNN enhances slowly indicate that the best classifier for this system is the ensemble
when the number of neighbors increases. The performance classifier (SVM + RF) and the RF, and the worst classifier
of the ensemble classifier is similar to the performance of for this system is Gaussian Naı̈ve Bayes. The SVM and the
the same classifier used to detect PNN emotional states. The ensemble classifiers are the most unsuitable classifiers for non-
performance of SVM gained a bit of improvement compared to binary emotion models, their accuracy is around 50%. They
6874
Authorized licensed use limited to: Sungkyunkwan University. Downloaded on March 28,2025 at 13:57:23 UTC from IEEE Xplore. Restrictions apply.
only perform well in the binary emotion model with the mea- [9] Martı́nez-Rodrigo, Arturo et al, ”Film Mood Induction and Emotion
surements of all metrics above 90%. RF and MLR classifiers Classification Using Physiological Signals for Health and Wellness
Promotion in Older Adults Living Alone” (2020) 37(2) Expert systems
consistently achieve good performance for both binary and [10] Shu L, Xie J, Yang M, Li Z, Li Z, Liao D, Xu X, Yang X. A Review
non-binary emotion models, their accuracy is achieved around of Emotion Recognition Using Physiological Signals. Sensors (Basel).
85% for non-binary emotion models and 91% for binary 2018 Jun 28;18(7):2074. doi: 10.3390/s18072074. PMID: 29958457;
PMCID: PMC6069143.
emotion models. The KNN, SVM, and ensemble classifiers [11] H. Z. Wijasena, R. Ferdiana and S. Wibirama, ”A Survey of Emo-
are only suited for detecting binary emotions. The performance tion Recognition using Physiological Signal in Wearable Devices,”
of the KNN classifier and SVM classifier gradually declines 2021 International Conference on Artificial Intelligence and Mecha-
tronics Systems (AIMS), Bandung, Indonesia, 2021, pp. 1-6, doi:
when the dimension of the emotional state increases. For the 10.1109/AIMS52415.2021.9466092
PNN model and the 3-D emotion model, the performance [12] E. Joy, R. B. Joseph, M. B. Lakshmi, W. Joseph and M. Rajeswari,
of these two classifiers is enhanced slowly. Conversely, their ”Recent Survey on Emotion Recognition Using Physiological Signals,”
2021 7th International Conference on Advanced Computing and Com-
performance declines to around 39% for the valence-arousal munication Systems (ICACCS), Coimbatore, India, 2021, pp. 1858-
emotion model. The ensemble classifier sustains its perfor- 1863, doi: 10.1109/ICACCS51430.2021.9441999.
mance at around 59% for the PNN emotion model and for [13] Ekman, P. (2003). “Emotions revealed: Recognizing faces and feelings
to improve communication and emotional life” (2nd ed.). New York,
the 3-D emotion model, its performance descends to 35%. NY: Owl Books.
The prerequisite of this pipeline is that all modalities must be [14] Mehrabian, A. and Russell, J.A. (1974), “An Approach to Environmental
completed and fully available for detecting human emotions. Psychology”, MIT Press, Cambridge
[15] Katsigiannis, S., & Ramzan, N. (2018). “DREAMER: A Database for
In real-world situations, not all modalities must be completed Emotion Recognition through EEG and ECG Signals from Wireless
and fully available, a further study needs to be conducted to Low-cost Off-the-Shelf Devices”. IEEE Journal of Biomedical and
address the issue of detecting emotions when some modalities Health Informatics, 22(1), 98–107. doi:10.1109/JBHI.2017.2688239.
[16] Işık, Ümran, Ayşegül Güven, and Turgay Batbat. (2023). ”Eval-
are absent. uation of Emotions from Brain Signals on 3D VAD Space via
Artificial Intelligence Techniques” Diagnostics 13, no. 13: 2141.
VI. ACKNOWLEDGMENT https://doi.org/10.3390/diagnostics13132141
[17] Piskioulis, Orestis & Tzafilkou, Katerina & Economides,
I am expressing our deepest gratitude and appreciation for Anastasios. (2021). “Mobile Sensing for Emotion Recognition”.
10.13140/RG.2.2.30277.99045.
the assistance and support received throughout the completion [18] Abhinav Dhall, Roland Goecke, Jyoti Joshi, Michael Wagner, and Tom
of this research paper. I am grateful for the invaluable support Gedeon. (2013). ”Emotion recognition in the wild challenge 2013”. In
provided by the Munster Technological University throughout Proceedings of the 15th ACM on International conference on multimodal
interaction (ICMI ’13). Association for Computing Machinery, New
the research process. York, NY, USA, 509–516. https://doi.org/10.1145/2522848.2531739
[19] Moataz El Ayadi, Mohamed S. Kamel, Fakhri Karray, “Survey on speech
R EFERENCES emotion recognition: Features, classification schemes, and databases”,
Pattern Recognition, Volume 44, Issue 3, 2011, Pages 572-587, ISSN
[1] Hoehe MR, Thibaut F. ”Going digital: how technology use may 0031-3203, https://doi.org/10.1016/j.patcog.2010.09.020.
influence human brains and behavior”. Dialogues Clin Neurosci. [20] Pravena, D., S. Nandhakumar, and D. Govind. ”Significance of nat-
2020 Jun;22(2):93-97. doi: 10.31887/DCNS.2020.22.2/mhoehe. PMID: ural elicitation in developing simulated full-blown speech emotion
32699509; PMCID: PMC7366947. databases.” 2016 IEEE Students’ Technology Symposium (TechSym).
[2] Casson AJ, Smith S, Duncan JS, Rodriguez-Villegas E. “Wearable IEEE, 2016.
EEG: what is it, why is it needed and what does it entail?” [21] Pravena, D., Govind, D. ”Development of simulated emotion speech
Annu Int Conf IEEE Eng Med Biol Soc. 2008;2008:5867-70. doi: database for excitation source analysis”. Int J Speech Technol 20,
10.1109/IEMBS.2008.4650549. PMID: 19164052. 327–338 (2017). https://doi-org.mtu.idm.oclc.org/10.1007/s10772-017-
[3] Ienca M, Malgieri G. “Mental data protection and the GDPR”. J 9407-3
Law Biosci. 2022 Apr 25;9(1):lsac006. doi: 10.1093/jlb/lsac006. PMID: [22] Rozemarijn Hannah Roes, Francisca Pessanha, and Almila Akdag
35496983; PMCID: PMC9044203. Salah. (2022) ”An Emotional Respiration Speech Dataset”. In
[4] Yan Wang, Wei Song, Wei Tao, Antonio Liotta, Dawei Yang, Companion Publication of the 2022 International Conference on
Xinlei Li, Shuyong Gao, Yixuan Sun, Weifeng Ge, Wei Zhang, Multimodal Interaction (ICMI ’22 Companion). Association for
Wenqiang Zhang, “A systematic review on affective comput- Computing Machinery, New York, NY, USA, 70–78. https://doi-
ing: emotion models, databases, and recent advances, Information org.mtu.idm.oclc.org/10.1145/3536220.3558803
Fusion”, Volumes 83–84, 2022, Pages 19-52, ISSN 1566-2535, [23] M. Soleymani, J. Lichtenauer, T. Pun and M. Pantic, ”A Multimodal
https://doi.org/10.1016/j.inffus.2022.03.009 Database for Affect Recognition and Implicit Tagging,” in IEEE Trans-
[5] Khan, Chy & Aziz, Nor & Emerson Raja, Joseph & Nawawi, Sophan actions on Affective Computing, vol. 3, no. 1, pp. 42-55, Jan.-March
& Rani, Pushpa. (2022). “Evaluation of Machine Learning Algorithms 2012, doi: 10.1109/T-AFFC.2011.25.
for Emotions Recognition using Electrocardiogram”. Emerging Science [24] S. Koelstra et al., ”DEAP: A Database for Emotion Analysis ;Using
Journal. 7. 147-161. 10.28991/ESJ-2023-07-01-011. Physiological Signals,” in IEEE Transactions on Affective Computing,
[6] Houssein, E.H., Hammad, A. & Ali, A.A. (2022) “Human emotion vol. 3, no. 1, pp. 18-31, Jan.-March 2012, doi: 10.1109/T-AFFC.2011.15.
recognition from EEG-based brain-computer interface using machine [25] Khatter, Anshul & Bansal, Dipali & Mahajan, Rashima. (2019).
learning: a comprehensive review”. Neural Comput & Applic 34, “Performance Analysis of IIR & FIR Windowing Techniques in
12527–12557. https://doi.org/10.1007/s00521-022-07292-4 Electroencephalography Signal Processing”. International Journal of
[7] Y. S. Can, B. Mahesh, and E. André, ”Approaches, Applications, Innovative Technology and Exploring Engineering. 8. 10.35940/iji-
and Challenges in Physiological Emotion Recognition—A Tutorial tee.J9771.0881019.
Overview,” in Proceedings of the IEEE, vol. 111, no. 10, pp. 1287-1313, [26] Khatter, Anshul & Bansal, Dipali & Mahajan, Rashima. (2019). “Design
Oct. 2023, doi: 10.1109/JPROC.2023.3286445. and Implementation of Efficient Digital Filter for Preprocessing of EEG
[8] Shu L, Xie J, Yang M, Li Z, Li Z, Liao D, Xu X, Yang X. ”A Review Signals”.
of Emotion Recognition Using Physiological Signals”. Sensors (Basel).
2018 Jun 28;18(7):2074. doi: 10.3390/s18072074. PMID: 29958457;
PMCID: PMC6069143.
6875
Authorized licensed use limited to: Sungkyunkwan University. Downloaded on March 28,2025 at 13:57:23 UTC from IEEE Xplore. Restrictions apply.