2024 Springer
2024 Springer
https://doi.org/10.1007/s00521-024-09426-2 (0123456789().,-volV)(0123456789().
,- volV)
REVIEW
Abstract
Emotion is an interdisciplinary research field investigated by many research areas such as psychology, philosophy,
computing, and others. Emotions influence how we make decisions, plan, reason, and deal with various aspects. Automated
human emotion recognition (AHER) is a critical research topic in Computer Science. It can be applied in many applications
such as marketing, human–robot interaction, electronic games, E-learning, and many more. It is essential for any appli-
cation requiring to know the emotional state of the person and act accordingly. The automated methods for recognizing
emotions use many modalities such as facial expressions, written text, speech, and various biosignals such as the elec-
troencephalograph, blood volume pulse, electrocardiogram, and others to recognize emotions. The signals can be used
individually(uni-modal) or as a combination of more than one modality (multi-modal). Most of the work presented is in
laboratory experiments and personalized models. Recent research is concerned about in the wild experiments and creating
generic models. This study presents a comprehensive review and an evaluation of the state-of-the-art methods for AHER
employing machine learning from a computer science perspective and directions for future research work.
Keywords Emotion recognition analysis Physical signals Intrusive and non-intrusive emotion recognition
Physiological signals Facial expressions Speech stimuli Body postures and gestures Machine learning and deep
learning techniques
123
Neural Computing and Applications
that relates to, arises from, or influences emotions,’’ or, in covered in this section. The fifth section expands on the
another way, any form of computing that has something to preceding sections’ findings, emphasizing the strengths and
do with emotions’’. The correct automatic identification of weaknesses of the reviewed studies. The last section pre-
emotions is the cornerstone of Affective Computing and it sents the conclusions and recommendations for further
is the subject of this study. research.
Emotion detection is, at its foundation, an automatic
classifier that can classify human emotions into different
categories [50, 105]. The process of creating an automatic 2 Emotion analysis
classifier is done as follows: gathering data, identifying the
features that are relevant to the goal, and then training the Humans have many ways to express their feelings. They
model to detect and classify specific patterns [59, 69]. The may express themselves by Writing, voice tone, facial
generated model is used afterward to categorize new data. expression, physiological reactions, body gestures and
For example, to build a model that can detect happiness postures, and physiological signals. Many emotion models
and sadness from facial expressions, researchers need to can be used to categorize these emotions. A suitable emo-
feed photos of people smiling and others of people tion model needs to be adopted to recognize and interpret
frowning, labeled as ‘‘happy’’ and ‘‘sad.’’. These images emotions from any modality. It should provide a set of
are used to build the classifier. Following that, when the permissible emotions for a specific scenario [124, 169].
classifier obtains an image of a person smiling it recognizes
the corresponding emotions [59, 69]. Building a model in 2.1 Evolution of emotion research
real life is not that simple. Not only there are a lot of data to
train and evaluate, but there is also an effort of interpre- Charles Darwin argued that humans and other animals
tation to be made, as we will see later. In addition, humans convey emotions with similar expressions and behavior
express their emotions in various ways, including facial under similar situations in 1872, after conducting some
expressions, voices or speaking, body gestures, move- psychological tests on facial expressions taken in various
ments, writing, and others. Even our bodies respond with circumstances on both humans and animals as in Ali et al.
visible physical reactions to emotions (breath and heart [4]. The period and events that occurred at the time altered
rate, pupil size, and so on). Recently, it has been also his perception of feeling. Emotions in humans and other
proved that the environment can affect physiological body animals took much time to realize over time, according to
reactions and emotions [27, 59, 75, 112, 174]. his views. He covered general principles of emotions and
Emotion detection technology has evolved especially in how humans and animals can express emotional states, the
the business sector due to the massive potential of knowing causes and effects of all possible emotions such as anxiety,
and predicting how the consumer is feeling. This review grief, depression, despair, joy, love, devotion, and so on,
differs from others by presenting recent publications that and the explanation of emotional states with images to
define and assess various modalities of emotions. We focus show the expressions of specific emotions.
on research that attempted to connect empirically assessed Some emotional expressions, according to Darwin, is
components of emotional experience to identifiable emo- universal for individuals all across the world. He also
tional states. This review analyzes and evaluates the tech- argued that animals of similar species, and humans, react
niques used in these studies using various methods and similarly to the same circumstance. His research revealed
summarizes their findings. It also gives an overview of that even in species that aren’t closely related, some
each type of emotional information source in the following emotions could have similar expressions. As noted in this
sections. In addition to examining methodologies that work Ali et al. [4], some philosophical and spiritual cate-
cover single assessments or uni-modal studies, and multi- gorization of emotions existed before that.
modal studies, to study various ways to measure or rec- Emotional research began as a sub-field of philosophical
ognize emotions. and psychological theories. Emotions and their expres-
We organize the rest of the review as follows: the sec- sions, according to Darwin, are likewise linked to biolog-
ond section describes emotion analysis, the evolution of ical reasons. They later described emotions as brain
emotion, and emotion models. The third section presents mechanisms that are outputs of neural system functional
various machine learning algorithms used in emotion features [4, 124, 169]. Figure 1 depicts the progression of
recognition. The fourth section examines emotion detection emotion in various fields of study. According to the evo-
and analysis using various inputs and models. Emotions lution theory, scientists developed different human emo-
derived from speech and physiological states, emotions tions at different stages of human life several times. Over
derived from text, facial expressions, body gestures, and time, psychologists, sociologists, neuroscientists, biolo-
combining environmental and physiological factors are all gists, and researchers from many other domains defined,
123
Neural Computing and Applications
123
Neural Computing and Applications
Ekman [46, 124] Anger, disgust, fear, joy, sadness, surprise Categorical –
Shaver et al. [144], Peng et al. Anger, fear, joy, love, sadness, surprise Categorical Tree
[124]
Oatley and Johnson- Laird Anger, anxiety, disgust, happiness, sadness Categorical –
[144], Peng et al. [124]
Circumplex Russell [133], Peng Afraid, alarmed, angry, annoyed, aroused, astonished, at ease, bored, calm, Dimensional Valence,
et al. [124] content, delighted, depressed, distressed, droopy, excited, frustrated, glad, arousal
gloomy, happy, miserable, pleased, relaxed, sad, satisfied, serene, sleepy, tense,
tired
Lovheim [99], Peng et al. [124] Anger/rage, contempt/disgust, distress/anguish, enjoyment/joy, fear/terror, Dimensional Cube
interest/excitement, shame/ humiliation, surprise/startle
(a) Linear discriminant analysis (LDA): LDA is a Fig. 5 Machine learning techniques explanation Liu and Lang [96]
supervised method used in creating ML models.
It is a technique for reducing the dimensionality saving resources and reducing training costs.
of data. It’s utilized in many tasks in ML and This type of reduction is employed in marketing
pattern categorization applications. The purpose applications such as image recognition and
of LDA is to convert the features from a higher- predictive analytics.
dimensional space to a lower one to avoid
dimensionality issues while simultaneously
123
Neural Computing and Applications
123
Neural Computing and Applications
2. Nonlinear classification includes the following nodes. Sub-node existence improves the homo-
algorithms geneity of the eventual sub-nodes as shown in
Fig. 9 [106].
(a) Gradient boosted machine (GBM): GBM or
(c) Neural network (NN): Artificial neural networks
xgboost is an ML technique that can be used
(ANNs), commonly known as neural networks
for regression and classification in a variety of
(NNs), are ML models that are inspired by the
contexts. It combines many weak prediction
working mechanism of the biological neural
models, the most common of them being deci-
networks in brains. NN is a supervised learning
sion trees, to form an improved prediction model
technology based on classification and can also
[74, 126]. Gradient boosted trees are used when a
be used for regression and clustering.
decision tree is a bad learner, and they typically
Artificial neurons (ANN) also called (MLp)
outperform random forests [74, 103, 126]. A
are a network of interconnected units or nodes
gradient-boosted trees model is built in the same
that are designed to look like real brain neurons.
step-by-step fashion as the boosting methods,
Each link can give a message to the other
with the extra feature of being able to optimize
neurons similar to synapses in the brain. A signal
any loss function as shown in Fig. 8 [177].
is received by an artificial neuron, which
(b) Decision tree (DT): is one of the most widely
processes it before sending it to the neurons to
used predictive modeling approaches in practice.
which it is connected. A nonlinear function is
It is a supervised ML technique for building a
used to activate a neuron’s inputs and determines
decision tree from training data. A decision tree
its output. Edges are the terms used to describe
is a regression and classification prediction
the connectors. The weight of neurons and edges
model (also known as a classification tree or a
varies regularly as you get more knowledge. The
reduction tree). It’s a translation from item
signal strength at a connection is affected by the
observations to target value judgments. Leaves
weight. Neurons may have a signaling threshold
(also known as labels) indicate classifications,
beyond which they can only transmit a signal if
non-leaf nodes represent features, and branches
the overall message is higher than the threshold.
represent feature combinations that lead to
The most frequent method of grouping neurons
categories [87]. In other words, input values, or
is through layers. The system’s inputs can be
sample is split into two or more homogeneous
adjusted in a variety of ways. After crossing the
sets (or sub-trees) relying on the most important
layers numerous times, signals move from the
main differences. DT utilizes several algorithms
first (input) layer to the last (output) layer
to decide to divide a node into two or more sub-
[65, 181]. Figure 10 represents A node-based
123
Neural Computing and Applications
123
Neural Computing and Applications
sort it into the right suitable category. (ii) The decision tree); smoothing decisions reduces
method can be used for both regression and variance in the model, ultimately improving the
classification however, classification is the most prediction error. Heuristically, the deviation of
widely used. (iii) Because this technique is the bagged estimator gbag^ should be equal to or
nonparametric, it makes no assumptions about smaller than the variance of the original estima-
the underlying data. (iv) It’s also known as a lazy ^ The reduction in deviation is higher when
tor ð:Þ.
learner algorithm. Figure 13 depicts a flow the initial estimator is unstable [22, 23, 184]. To
diagram of KNN algorithm [149]. create a model with lower variance, this method
(i) Bagging CART: The ensemble method of boot- ‘‘averages’’ the predictions from several different
strap aggregating, or bagging, increases the models after they have been fitted. It takes the
accuracy of unstable models by averaging a set following steps to fit several models: First, they
of the same model fit to bootstrapped samples of create several bootstrap samples, each of which
feature space. Consider the following scenario: functions as a distinct (almost) independent
the data is presented as a collection of p dataset selected from the true distribution. Then,
predictors X1 , ..., Xn with a response vector for each of these samples, a weak learner is
(Y ¼ Y1 ; . . .; Yn ). We use some base procedure fitted, and the results are combined to build an
gð:Þ^ to model the relationship between X and Y. ensemble model that has less variance than the
^
A bagged model, gbagð:Þ, is a linear combination sum of its parts. Similar to how the bootstrap
^
of several gð:Þ fit to bootstrapped samples of X. samples are generally independent and identi-
Bagging acts as a smoothing operator for hard cally distributed (i.i.d. ), the learned base models
loss functions (consider a single split in a exhibit this property as well. Finally, by ‘‘aver-
aging’’ the outputs of the weak learners, the
variance is decreased without changing the
projected result. In other words, bagging entails
building an ensemble model that averages the
outputs of these weak learners by fitting several
base models to various bootstrap samples, as
shown in Fig. 14 [174].
(j) Stacking Method: Using a meta-classifier, the
ensemble learning technique of stacking combi-
nes different classification models. The outputs
(meta-features) of each classification model in
the ensemble are utilized to fit the meta-classifier
after each classification model has been trained
Fig. 13 K-nearest neighbor architecture Srivastava [149]
123
Neural Computing and Applications
Fig. 14 Explanation of
bootstrap aggregating method
(bagging) Younis et al. [174]
individually using the whole training set. The 3.2 Regression algorithms
predicted class labels or the ensemble probabil-
ities can both be utilized to train the meta- Regression is a supervised learning algorithm. The target
classifier. As can be shown in Fig. 15, stacking is variable must be numeric. It has the following variants:
also known as a stacked generalization, which is
1. Linear regression: includes the following algorithms:
an ensemble method.
Regression, density estimations, distance (a) Linear regression LR: Simple LR has only one
learning, and classifications have all had effec- input variable and one output variable, whereas
tiveness with stacking. First, stacking frequently multiple linear regression has one output variable
investigates heterogeneous weak learners (dif- but many input variables. The goal of an LR
ferent learning methods are combined), whereas algorithm is to identify a linear equation between
bagging and boosting generally take homoge- the input and output variables. In linear regres-
neous weak learners into account. Second, while sion, the input and output variables can be
bagging and boosting use deterministic tech- calculated by the following formulae:
niques to combine weak learners, stacking uses a - Linear regression: y ¼ b0 þ b1 x
meta-model to combine the underlying models - Multiple: y ¼ b0 þ b1 x1 þ þ bn xn
[174]. Here, the ‘x’ variables are the input features,
and ‘y’ is the output variable. b0 ; b1 ; . . .; bn
represent the coefficients that are to be generated
by the linear regression algorithm.
(b) Stepwise regression SR: is a technique used for
selecting the best features for multiple linear
regressions. There are three types of SR: back-
ward elimination, forward selection, and bidi-
rectional elimination.
(c) Ridge regression: In situations when the inde-
pendent variables are highly correlated, ridge
regression is a technique for estimating the
coefficients of multiple-regression models. It
has been applied to a variety of disciplines,
including engineering, chemistry, and economet-
rics. It is often referred to as Tikhonov regular-
ization, after Andrey Tikhonov, and it is a
technique for regularizing improperly posed
issues. It is very helpful in reducing the multi-
collinearity issue in linear regression, which
frequently arises in models with several param-
eters. In return for a manageable degree of bias,
the approach generally offers enhanced effi-
Fig. 15 Flowchart of stacking classification ensemble Younis et al. ciency in parameter estimation issues.
[174]
123
Neural Computing and Applications
(d) Lasso regression: Less absolute shrinkage and (g) Logistic regression (LogR): basically for a given
selection operator, often known as lasso or collection of features (or inputs), X, the target
LASSO, is a regression analysis technique used variable (or output) y can only take discrete
in statistics and machine learning that does both values. Contrary to popular assumptions, logistic
variable selection and regularization to improve regression is a regression model. The model
the predictability and understandability of the creates a regression model to forecast the
final statistical model. Lasso was initially devel- likelihood that a particular data entry belongs
oped for models of linear regression. to a specific category. LogR uses a sigmoid
(e) Elastic net regression: The two most widely used function to represent the data, just like lR
regularized linear regression methods, lasso, and assumes that the data follow a linear model as
ridge, are combined to create an elastic net. in Eq. 2 and shown in Fig. 16 [182].
Ridge employs an L2 penalty, while Lasso 1
employs an L1 penalty. In other words, elastic gðxÞ ¼ ð2Þ
1 þ e x
net linear regression regularizes regression mod-
els by applying the lasso and ridge techniques’ Logistic regression can be categorized based on
penalties. By taking into account their drawbacks the following categories: 1. Binomial: There are
to enhance the regularization of statistical mod- just two potential kinds for the target variable:
els, the strategy combines the lasso and ridge ‘‘0’’ or ‘‘1,’’ which can indicate ‘‘win’’ versus
regression methods. The lasso’s shortcomings— ‘‘loss,’’ ‘‘pass’’ vs. ‘‘fail,’’ ‘‘dead’’ vs. ‘‘alive,’’
namely the fact that it only collects a small and so on. 2. Multinomial: The target variable
sample size for high-dimensional data—are can have three or more non-ordered types (i.e.,
improved by the elastic net method. The elastic the types have no quantitative importance), such
net method allows the incorporation of ‘‘n’’ as ‘‘illness A’’ vs. ‘‘disease B’’ vs. ‘‘disease C.’’
variables up until saturation. When the variables 3. Ordinal: it deals with ordered categories of
are grouped into highly correlated groups, lasso target variables. A test score, for example, can be
usually selects one variable from each group classified as ‘‘extremely poor,’’ ‘‘bad,’’ ‘‘good,’’
while completely ignoring the others. Where the or ‘‘very good.’’ Each category can be given a
dimensional data exceed the number of samples score of 0, 1, 2, or 3 points.
used, the elastic net technique is best suited. 2. Nonlinear regression: The algorithms mentioned in
(f) Principle component regression (PCR): is a this method were explained in 2
principal component analysis-based regression
analysis tool. It is used to predict unknown
regression coefficients in a standard linear 3.3 Unsupervised learning
regression model. Instead of explicitly regressing
the dependent variable on the independent vari- Unlike supervised learning of the following algorithms, it
able, PCR uses the principal components of the has only x values and no labels for the data points. This
explanatory factors as regressors. Because only a method is significant when grouping data points that have
subset of all the principle components is often similar qualities. Clustering is an example of unsupervised
employed for regression, PCR is both a regular- learning.
ized process and a shrinkage estimator. • Clustering in unsupervised learning consists of the
One of the most common applications of PCR following algorithms:
is to solve the multicollinearity problem, which
occurs when two or more input variables are – K-means: is one of the most basic and often used
nearly collinear. By removing part of the low- unsupervised ML techniques. K-means sorts similar
variance principal components in the regression data into groups or clusters. Data within a specific
step, PCR can effectively cope with such situ- cluster bear a higher degree of commonality among
ations. Furthermore, because PCR usually only observations within the cluster than it does with
regresses on a fraction of all the principal observations outside of the cluster. In other words,
components, it can significantly reduce the before assigning each data point to the cluster with
number of parameters that characterize the the fewest centroids, the K-means method calculates
underlying model, resulting in dimension reduc- k centroids. The algorithm aims to find and combine
tion. This is especially beneficial in situations objects into groups (K) [113].
with many dimensional data [104].
123
Neural Computing and Applications
– Self-organizing map (SOM): The Kohonen SOM is learning tasks like building classification or regression
an unsupervised ANN able to handle nonlinear models. The two processes for training a DBN are
problems that can be used for exploratory data layer-by-layer training and fine-tuning. As indicated in
analysis, pattern recognition, and variable relation- Fig. 17 [108], the top two layers have no direction. The
ship assessment. It is often used to cluster high- layers above have direct links to lower layers. Figure 17
dimensional data. There are only three levels to it. depicts that layer-by-layer training refers to the unsu-
pervised training of each RBM, whereas fine-tuning
The input layer: consisting of n-dimensional
refers to the use of error backpropagation methods to
inputs.
enhance the parameters of the DBN after the unsuper-
Weight layer: weight vectors that are customized
vised training.
and represent the network’s processing units.
• Deep neural networks (DNNs): are feed-forward net-
Kohonen layer: a computational layer made up of
works (FFNNs), in which data are transferred from the
processing units that are arranged in a 2D lattice-
input to the output layer without traveling backward,
like pattern (or 1D string-like structure). SOMs
and the links between the layers are only one way,
have the unique ability to map high-dimensional
forward. The findings are achieved via backpropaga-
input features into spaces with fewer dimensions
tion, which employs supervised learning using datasets
[92].
of specific information depending on ’what we wish
[1, 156]. Figure 18 depicts a representation of the deep
3.4 Deep learning algorithms DL neural network method. Each layer is followed by a
nonlinear function called the activation function like
is divided into supervised and unsupervised learning. sigmoid, relu, and tanh [48].
3.4.1 Supervised DL
123
Neural Computing and Applications
Fig. 18 A deep neural network architecture Feng et al. [48] Fig. 19 A recurrent neural network model Lakshmanna et al. [86]
123
Neural Computing and Applications
The distribution of the data itself is part of a gener- distributed throughout the space, discriminative models
ative model, which also indicates the likelihood of an attempt to define boundaries in the data space. The
example. For instance, models that can assign a prob- following diagram 20 shows the discriminative and
ability to a sequence of words are often generative (and generative models of handwritten digits [14].
considerably simpler than GANs) and can predict the Note, by constructing a line in the data space, the
next word in a sequence. A discriminative model discriminative model aims to distinguish between
ignores the question of whether a given instance is handwritten 0 s and 1 s. If the line is drawn correctly, it
likely and just tells you how likely a label is to apply to can discriminate between 0 s and 1 s without ever
the instance such as the k-nearest neighbors algorithm, having to represent the precise placement of the
Logistic regression, support vector machines, decision instances on either side of the line in the data space. The
tree learning, random forest, maximum-entropy Markov generative model, on the other hand, attempts to gen-
models, conditional random fields, etc. erate acceptable 1 s and 0 s by producing digits that
With the development of deep learning, a new family closely resemble their actual counterparts in the data
of techniques called deep generative models (DGMs) space. The distribution throughout the entire data space
Tomczak [157] is established that is a group of methods must be modeled.
that train deep neural networks to simulate the distri-
1. Autoencoder: The AE is a generative method that
bution of training samples. In particular, these methods
can be suitable for extracting the features and
cover the Gaussian mixture model (and other types of
reducing the size with the same number of input
mixture models), the hidden Markov model proba-
and output units. These input and output layers are
bilistic context-free grammar Bayesian network (e.g.,
connected with one or more hidden layers. An
Naive Bayes, autoregressive model), averaged one-de-
autoencoder neural network is a type of unsuper-
pendence estimators, latent Dirichlet allocation, Boltz-
vised learning method that uses backpropagation to
mann machine (e.g., restricted Boltzmann machine,
establish target values equal to inputs. In other
deep belief network), but the most popular methods are
variational autoencoders (VAE), generative adversarial words, it employs the formula yðiÞ ¼ xðiÞ where yðiÞ
networks (GAN), autoregressive models, flow-based represents output nodes and X ðiÞ represents input
methods, and diffusion models in addition to numerous nodes. The autoencoder tries to learn the hW;b ðxÞ
hybrid approaches. These techniques are compared and x function which represents the activation function.
contrasted, explaining the premises behind each and In other words, it’s attempting to learn a close
how they are interrelated while reviewing current state- approximation to the identity function, resulting in
of-the-art advances and applications [19, 157]. an output x^ that looks like x. The identity function
These models have their roots in the 1980s and aim appears to be a very simple function to learn; but,
to learn about data without supervision, potentially by imposing constraints on the network, such as
providing benefits for standard classification tasks; restricting the number of hidden units, we can
gathering training data for unsupervised learning is uncover intriguing data structures [12]. Figure 21
much easier and less expensive than collecting labeled shows a brief architecture of an autoencoder.
data, there is still a lot of information available, indi- Finally, Autoencoder is composed of two parts:
cating that generative models can be helpful for a wide the encoder and the decoder. The function of the
range of applications. Thus, generative modeling has encoder includes compressing and encoding the
been applied in many tasks including emotion recog-
nition, image synthesis: super-resolution, text-to-image
and image-to-image conversion, inpainting, attribute
manipulation, pose estimation; video: synthesis and
retargeting; audio: speech and music synthesis; text:
summarization and translation; reinforcement learning;
computer graphics: rendering, texture generation,
character movement, liquid simulation; medical: drug
synthesis, modality conversion; and out-of-distribution
detection [19].
In comparison, discriminative models can only
handle a simpler task than generative models. More
modeling is required for generative models. While Fig. 20 Discriminative and generative models of handwritten digits
generative models attempt to represent how data is Bernardo et al. [14]
123
Neural Computing and Applications
Fig. 22 Stacked autoencoder architecture Shastry et al. [143] Fig. 23 Denoising autoencoder architecture Majtner et al. [107]
123
Neural Computing and Applications
penalize the generator and vice versa. The generator dataset or generated by the generator model).
should be able to consistently produce exact copies (generated). The discriminator is a typical classifi-
from the input domain. The discriminator is unable cation paradigm that is widely known. We are
to distinguish between the two and consistently interested in a reliable generator, hence we discard
forecasts ‘‘unsure’’ (e.g., 50% for real and false). the discriminator after training.
Essentially, this is an actor-critic model. It’s crucial Deep convolutional generative adversarial net-
to remember that each model can completely work, also known as DCGAN, was one of the first
eclipse another. The generator will have trouble models on GAN to use convolutional neural
reading the gradient if the discriminator is too networks. This network produces an image with
effective since it will produce values that are too the desired shape after receiving 100 randomly
close to 0 or 1. False negatives can result if the chosen numbers from a uniform distribution as
generator is too strong since it will take advantage input. There are numerous convolutional, deconvo-
of the discriminator’s flaws. The ‘‘skill level’’ of lutional, and fully linked layers in the network. To
both neural networks must be equivalent, as deter- translate the input noise into the desired output
mined by their respective learning rates [102]. image, the network employs numerous deconvolu-
Generator model: the generator creates a sample tional layers. The network’s training is stabilized
in the domain using an input random vector of fixed via batch normalization. All layers in a generator
length. A Gaussian distribution is used to generate employ ReLU activation, except the output layer,
the vector at random. After training, this multidi- which uses the tanh layer, and all layers in a
mensional vector space’s points will match those in discriminator use leaky ReLU. Mini-batch stochas-
the issue domain, resulting in a compressed repre- tic gradient descent was used to train this network,
sentation of the data distribution. A vector space and the Adam optimizer was employed to speed up
made up of latent variables is what is referred to as training with adjusted hyperparameters. The
a latent space. In the case of GANs, the generator paper’s findings were pretty intriguing. The authors
assigns meaning to points in a predetermined latent demonstrated how the generators’ intriguing vector
space. Points selected from the latent space can be arithmetic features could be used to modify images
given to the generator model as input and used to in the way we chose [19, 54]. Figures 25, 26 depict
produce new and distinctive output examples. After the structure of the discriminator and generator of
training, the generator model is kept and used to deep conventional generative adversarial networks.
generate new samples. Also, conditional GANs are one of the most
The discriminator model: the discriminator frequently used GAN variants. They are created by
model predicts a binary class label of real or fake simply adding a conditional vector to the noise
based on an input example (actual from the training vector. See Fig. 27 depicts the structure of
123
Neural Computing and Applications
123
Neural Computing and Applications
123
Neural Computing and Applications
Fig. 28 Flow-based models architecture Madani et al. [102], Bond-Taylor et al. [19]
123
Neural Computing and Applications
Output generated Unrealistic and blurry Good, but discriminates only between ‘‘fake’’ and ‘‘real’’ Good
Types of data Variable data
Computational cost Moderate Moderate Extremely High
denoising diffusion probabilistic model (DDPM), (forward and reverse diffusion methods) on a specific
which was initially proposed by Ho et al. [62] after image 29 as a reference using the following scenario:
being initialized by Sohl-Dickstein et al. [148]. They ‘‘The original image’s structure (distribution) is grad-
also explored several additional methods, including ually destroyed by adding noise and then using a neural
stable diffusion and score-based models. The previ- network model to reconstruct the image, i.e., remove
ously mentioned generative techniques are all funda- the noise at each step. The model finally learns to
mentally distinct from diffusion models. They attempt, estimate the underlying (original) data distribution by
intuitively, to break down the sampling-based image- repeating this process often enough with high-quality
generating process into several small ‘‘denoising’’ data. The trained neural network can then be used to
steps. The idea behind this is that the model may self- create a new image that is a representation of the first
correct over these minor adjustments and progressively training dataset, starting with just noise’’.
provide a high-quality sample. In certain instances,
(a) Forward diffusion: 1- The original image (x0 ) is
models like the alpha-fold have previously employed
slowly corrupted iteratively (a Markov chain) by
this concept of representation refinement. But, nothing
adding (scaled Gaussian) noise.
comes at zero cost. This iterative process makes them
2- This process is done for some T time steps,
slow at sampling, at least compared to GANs [100].
i.e., xT .
Also, several generation tasks, including image,
3- Image at timestep t is created by: xt1 þ
speech, 3D form, and graph synthesis, have already
t1 ðnoiseÞ ! xt
used diffusion models.
4- No model is involved at this stage.
Forward diffusion and parametrized reverse are the
5- Due to the iterative addition of noise, at the
two processes that form diffusion models. Thus, the
end of the forward diffusion stage xT , we are left
diffusion method is illustrated by the following: ‘‘de-
with a (pure) noisy image that represents an
stroy a data distribution’s structure methodically and
‘‘isotropic Gaussian.’’ This is simply a mathe-
gradually using an iterative forward diffusion
matical way of stating that the distribution has a
approach. A highly adaptable and manageable gener-
conventional normal distribution and that its
ative model of the data is produced when we train a
variance is constant across all dimensions. The
reverse diffusion process that restores structure to the
data distribution has been changed to a Gaussian
data. With this method, we may quickly learn about,
distribution.
the sample from, and assess probabilities in deep
(b) Backward/Reverse diffusion: 1- We reverse the
generative models’’ [148]. The authors in Sohl-Dick-
forward procedure at this point. Iteratively again,
stein et al. [148] built a generative Markov chain that
the aim is to take out the noise that was added
converts a simple known distribution (e.g., a Gaussian)
during the forward operation. (a Markov chain).
into a target (data) distribution using a diffusion pro-
An artificial neural network model is used for
cess and means that the state of an entity/object at any
this.
point in the chain depends solely on the previous
2- The model is tasked with the following:
entity/object. Now, we can apply both two methods
Given a timestep t and the noisy image xt , predict
123
Neural Computing and Applications
0
the noise ( ) added to the image at step t 1. resolution results that outperform GANs in human
0
3- xt ! Model ! (predicted noise). The assessments. High-fidelity ImageNet samples produced
noise that is added to xt1 during the forward by CDM outperform BigGAN-deep and VQ-VAE2 by
pass is predicted (approximated) by the model. a significant margin on the FID score and Classification
Accuracy Score [63].
As previously mentioned, several design issues are
Thus, it has been proven that diffusion models beat
presented by the enormous class of machine learning
GAN, VAE, and especially GAN. Both diffusion
(ML) tasks known as natural image synthesis, which
models and GANs have found wide usage in the field
has numerous applications. Image super-resolution is
of image, video, and voice generation resulting in
one instance, in which a model is trained to convert an
better results, such that generative adversarial networks
unrefined low-resolution image into an accurate high-
(GANs) have been a research area of much focus in the
resolution image. (e.g., RAISR). There are many uses
last few years due to the quality of output they produce.
for super-resolution, from enhancing medical imaging
While Diffusion models have become increasingly
systems to recovering old family portraits. Class-con-
popular as they provide training stability as well as
ditional image creation is another image synthesis task
quality results on image and audio generation [132].
in which a model is trained to produce a sample image
Even though GANs provide the foundation for image
given an input class label. The resulting generated
synthesis in a wide range of models, they do have
sample images can be utilized to enhance the func-
several drawbacks that researchers are actively
tionality of subsequent models for image segmentation,
attempting to solve.
classification, and other tasks [63]. Deep generative
models, including GANs and VAEs models, typically • Disappearing gradients: The problem of disappear-
handle such image synthesis issues. However, when ing gradients might cause the generator training to
trained to create high-quality samples from challeng- fail if the discriminator is too good.
ing, high-resolution datasets, each of these generative • Mode collapse: A generator can learn to only create
models has drawbacks. For instance, mode collapse one output if it produces an unusually believable
and unstable training are common problems for GANs. result. The discriminator’s optimal plan is to
While VAE suffers from blurry results. As an alter- develop the habit of rejecting such output without
native, diffusion models, which were first proposed in exception. And Google continues, ‘‘But if the next
Sohl-Dickstein et al. [148], have recently come back generation of discriminator becomes trapped in a
into interest due to their training stability and their local minimum and doesn’t find the best strategy,
encouraging sample quality results for the creation of then it’s too simple for the next generator iteration
images and audio. As a result, as compared to other to find the ideal output for the current discrimina-
types of deep generative models, they may present tor’’ [132].
more advantageous trade-offs. Diffusion models cor- • Failure to converge: GANs frequently experience
rupt training data by gradually introducing Gaussian this problem as well.
noise, gradually removing details until the data is pure According to these issues, OpenAI researchers have
noise, and then training a neural network to reverse this shown that diffusion models can achieve image sample
corruption. Running this reversed corruption procedure quality superior to the generative models but come
produces a clean sample by gradually denoising pure with some limitations [38]. This paper Dhariwal and
noise to create data from it. This synthesis process can Nichol [38] said that the researchers could achieve this
be thought of as an optimization algorithm that gen- on unconditional image synthesis by finding a better
erates likely samples by following the gradient of the architecture through a series of ablations. For condi-
data density [63]. tional image synthesis, the team improved sample
SR3 (super-resolution by repeated refinements) and quality with classifier guidance. Researchers added that
CDM (cascaded diffusion models), a model for class- they believe two variables contribute to the difference
conditioned synthesis, are two connected techniques between diffusion models and GANs:
that today’s authors propose that push the limits of the ‘‘There has been an extensive exploration of the
image synthesis quality for diffusion models. The model architectures employed in current GAN work.
authors demonstrated that current methods (diffusion The paper Dhariwal and Nichol [38] said, ’’GANs can
models such as SR3 and CDM) outperformed previous trade off variety for fidelity, resulting in high-quality
methods (GANs, VAEs) by scaling up diffusion mod- samples but not covering the entire distribution’’.
els and using well-chosen data augmentation strategies. Also, Google AI diffusion model introduced two
In particular, SR3 achieves robust image super- connected approaches named super-resolution via
123
Neural Computing and Applications
repeated refinements (SR3) and cascaded diffusion suggested two advanced methods, the smoothed SSGAN
models (CDM). The authors proved that these (SSSGAN) and the virtual smoothed SSGAN (VSSSGAN),
approaches produced higher-image synthesis quality which, respectively, use virtual adversarial training (VAT)
than GANs [38]. and adversarial training (AT) to smooth the SSGAN’s data
According to DiffWave diffusion model, it gener- distribution. Using labeled instances as inputs, the
ates high-fidelity audio for a variety of waveform SSSGAN smooths the conditional label distribution; the
creation tasks. It contains class-conditional generation, VSSSGAN smooths the conditional label distribution
unconditional generation, and neural vocoding condi- without label information (or ‘‘virtual’’ labels) [179].
tioned on the Mel spectrogram. Results demonstrated The results showed that the suggested strategies out-
that in the unconditional generation task, it greatly performed the latest methods. The distributional smooth-
outperformed autoregressive and GAN-based wave- ness of the SSSGAN and VSSSGAN makes them more
form models in terms of audio quality and sample robust than the SSGAN in experimental settings with
diversity according to several automatic and human mismatched and semi-mismatched unlabeled training sets.
evaluations [38]. Finally, we can summarize the dif- Also, to assess the performance of the suggested approa-
ference between these four generative methods as fol- ches in intradomain and interdomain scenarios, several
lows in Table 3 and Fig. 30. tests are run using the IEMOCAP dataset and three other
publically accessible corpora [179]. This dataset is com-
posed of five dyadic sessions corresponding to almost 10 h
There are many papers that used generative models to
of recording time. Male and female speakers interact in
predict emotional labels. The authors in Zhao et al. [179]
pairs during each session in both scripted and unscripted
proposed the semisupervised generative adversarial net-
spoken communication scenarios. While respondents’
work (SSGAN) for SER (Speech Emotion Recognition)
emotions were evoked in hypothetical settings, performers
which aims to identify emotion states from speech signals
were requested to transmit the appropriate semantic and
to capture underlying knowledge from both labeled and
emotional content in scripted scenarios. Three evaluators
unlabeled data. Such that, the SSGAN is derived from a
divided each session into turns and labeled each turn with
GAN, but its discriminator can categorize input samples as
emotion (such as neutral, happy, sad, angry, surprised, fear,
real or false and determine their emotional class if they are
disgust, frustration, excitement, and others). Each turn is
true. As a result, it is possible to learn how to distribute
given an emotional label based on majority agreement. In
actual inputs in a way that encourages label information
line with previous research, only four emotion categories
transfer across labeled and unlabeled data. This article
Goal VAEs (variational A flow-based generative model GAN provides a smart solution Non-equilibrium
autoencoders) maximize the is built via a series of to model the data generation, thermodynamics serves as the
evidence lower bound invertible transformations. an unsupervised learning basis for diffusion models.
(ELBO), which implicitly The loss function is only the problem, as a supervised one. They learn to reverse the
maximizes the log-likelihood negative log-likelihood since, The discriminator model diffusion process to create
of the data unlike the other two, the learns to distinguish the real desired data samples from the
model explicitly learns the data from the fake samples noise after defining a Markov
data distribution that are produced by the chain of diffusion steps to
generator model. Two models gradually introduce random
are trained as they are playing noise to data. Diffusion
a minimax game models are trained using a
predefined technique, in
contrast to VAE or flow
models, and the latent variable
has high dimensionality (the
same as the original data)
Pros Fast Sampling rate. Diverse Fast Sampling rate. Diverse Fast Sampling rate. High sample High sample generation quality.
sample generation sample generation generation quality Diverse sample generation
Cons Low sample generation quality Need specialized architecture, Unstable training, low sample Low sampling rate
low sample generation quality generation diversity (Mode
Collapse)
123
Neural Computing and Applications
(i.e., neutral, sadness, happiness, and anger) are taken into remaining instances in the training set are known as unla-
consideration in our experiments. Additionally, examples beled examples. Table 4 lists many comparison techniques
of excitement and happiness are combined. For these that perform well on the IEMOCAP dataset, including two
studies, a total of 5 531 turns (1 708 for neutral, 1 084 for supervised approaches and four semisupervised learning
sadness, 1 636 for happiness, and 1 103 for anger) were methods. SVM and DNN are chosen as the foundational
used [179]. supervised techniques. DNN has a similar framework to the
To evaluate the performance of semisupervised SER SSGAN, but it does not have an unsupervised loss. The
with 300, 600, 1 200, and 2400 labeled examples, the SVM used in the INTERSPEECH 2009 emotion challenge
training set is used to randomly choose labeled instances, is a linear SVM that was trained using a relatively limited
with an equal number of examples for each category. The quantity of labeled data. Additionally, four semisupervised
learning approaches—the self-training and denoising
autoencoder (DAE) combined with SVM, the SSAE, and
Table 4 Averages of UARs [%] with the standard deviation over ten
different experimental results With 300, 600, 1200, and 2400 labeled the semisupervised ladder autoencoder (SS-LAE)—are
data. Several baseline supervised methods and semisupervised compared with our suggested methods. To ensure a fair
learning methods are selected for comparison [179] comparison, our adopted decoder’s structure and the sug-
UAR [%] methods # of labeled data gested method’s generator are compatible, and the offered
techniques’ validation process is also compatible. Table 4
300 600 1200 2400
displays the outcomes of the methods of comparison.
Supervised methods: The experimental results in Table 4 demonstrate that, in
DNN 49:81:1 51:61:5 53:70:6 55:40:9 terms of the average UAR, the suggested methods out-
SVM 48:51:7 49:31:2 51:21:6 53:11:1 performed the two supervised methods and the four
Semisupervised semisupervised learning methods with various amounts of
methods: labeled data. The SSSGAN and VSSSGAN significantly
Self-training?SVM 49:11:8 50:21:3 52:91:2 53:91:9 outperformed alternative techniques at p\0:05. Given
DAE?SVM 50:91:5 52:11:5 54:72:1 55:80:7 2400 labeled data points, the SSGAN performs as well as
SSAE 51:11:1 52:41:2 55:42:1 56:41:2 the SSSGAN and VSSSGAN. However, when AT and
SS-LAE 51:70:7 52:81:6 56:21:5 56:91:9 VAT are used to smooth the conditional label distribution’s
Proposed method: output, authors of Zhao et al. [179] could see improve-
SSGAN 51:62:0 54:21:5 56:70:7 57:81:6 ments of 1.5% and 0.9%, respectively. An instance is that
SSSGAN 51:91:3 55:30:9 57:82:0 59:31:3 VAT and AT can investigate the input’s adversarial
VSSSGAN 52:32:1 55:41:7 57:11:5 58:70:9
123
Neural Computing and Applications
orientation, enhancing the robustness of the suggested according to experimental results on two benchmark
approaches. datasets, the Topical Chat and Document Grounded Con-
Also, they examined the effects of the number of labeled versation datasets. These results showed that the proposed
data and compared the performance of the offered methods method significantly outperformed baseline models in
to modern techniques. The effectiveness of the suggested terms of both automated and human evaluation metrics.
procedures with labeled data from 300, 600, 1200, and In other words, this research introduces EmoKbGAN, a
2400 is shown in Fig. 31. unique knowledge-grounded neural network conversation
Figure 31 illustrates how the performance of various model that uses both the underlying knowledge base and
approaches grows with the amount of labeled data. Nota- emotion labels to produce more in-depth and interesting
bly, performance grows gradually as the amount of labeled responses. The MLE objective to supervise the training
data doubles. process is proposed to be replaced by multi-attribute dis-
These findings imply that more labeled data is not criminator training as they expand on the framework pro-
necessarily advantageous for the suggested strategies. vided by Varshney et al. [162]. Contrarily, this approach
Additionally, the VSSSGAN achieves a 1.2% improvement primarily used two distinct models: a transformer-based
in the UAR with 600 labeled data when compared to the language model, which aims to produce relevant responses
SSGAN. In the meantime, the relative improvement is, with the support of attribute features provided as input to
respectively, 0.7%, 0.4%, and 0.9% for the 300, 1200, and the model, and the two discriminators, which direct the
2400 labeled data. These findings imply that the quantity of generation process by calculating the likelihood that sam-
labeled data affects how much performance is improved. pled sentences will satisfy the given constraints.
Additionally, when fewer labeled data are available, the The authors in Varshney et al. [162] evaluated the
VSSSGAN performs better than the SSSGAN. In contrast, proposed model on the knowledge-grounded Topical Chat
the SSSGAN outperforms the VSSSGAN if there are more dataset with around 11K human–human conversations.
labeled data available. This finding suggests that adding Each conversation’s words are based on one of eight major
more labeled data may aid in smoothing the adversarial categories: fashion, politics, books, sports, popular culture,
direction of the model [179]. music, science & technology, and movies. Given that the
Using the Generative Adversarial Network (GAN) in annotators just used their common sense knowledge when
multiple-discriminator settings and joint minimization of writing the utterances in the dataset, some of them may not
the losses provided by each attribute-specific discriminator have any knowledge attached to them. The emotions that
model (knowledge and emotion discriminator), the authors each phrase in a dialogue conveys are noted, such as anger,
in Varshney et al. [162] presented a technique called disgust, fear, sadness, happiness, surprise, curiosity to dive
EmoKbGAN for automatic response generation in this deeper, and neutrality. Five separate sets of data—Train,
paper. The model could produce sentences that flow nat- Valid Frequent, Valid Rare, Test Frequent, and Test
urally with better control over emotion and content quality, Rare—have been created. Conversations about entities that
are commonly found in the training set are included in the
frequent set. The conversations in the rare set are about
entities that were only occasionally seen in the training set.
On the frequent dataset, they presented the findings of our
experiments. Also, the authors performed experiments on
the Document Grounded Conversations Dataset. The
statements are based on information about the cast, the
plot, the introduction, the reviews, and a few scenes. The
typical document contains 200 words or less. We classify
the target utterances of the CMU-DoG dataset using a
BERT-based emotion classifier that has been trained on the
utterances of the Topical Chat dataset. 200 sentences from
the test set were utilized to evaluate the performance of the
used model. They obtained an overall accuracy rating of
0.74 on the test set.
Researchers in Varshney et al. [162] also conducted
ablation studies for the multi-source generator and attri-
bute-specific discriminators to demonstrate the effective-
Fig. 31 Performance of the proposed methods with 300, 600, 1200,
and 2400 labeled data points in terms of the UAR (%) Zhao et al. ness of each EmoKbGAN module. The models are
[179] KbGAN: EmoKbG with only a knowledge discriminator;
123
Neural Computing and Applications
EmoGAN: EmoKbG with only an emotion discriminator. well-trained experts with postgraduate exposure, they
EmoKbG: Incremental transformer with twin decoders. We evaluated the predicted responses using the following
compare the outcomes of primary decoding and secondary metrics: Fluency, Adequacy, Knowledge Relevance, and
decoding to illustrate the twin decoder’s effectiveness. Emotional Content. On a scale from 0 to 2, authors in
EmoKbGAN-SD and EmoKbGAN-PD are the model Varshney et al. [162] rated responses for fluency, suffi-
names for EmoKbGANs that lack secondary and primary ciency, and knowledge relevance, with a score of 0
decoders, respectively. The last three utterances and the denoting an incomplete or unfinished response, a score of 1
associated text-based information serve as our input. The denoting a satisfactory response, and a score of 2 denoting
hidden size is set at 512 for all models. They employed a an accurate response. On a scale of 0 to 1, where 0 denotes
three-layer bidirectional LSTM with dot product attention the incorrect emotion and 1, the proper emotion, they rated
for the Seq2Seq-based generator. The number of encoder the emotional content. They calculated the Fleiss’ kappa
and decoder layers for transformer-based models is set to 3. value to assess the level of agreement between two anno-
Eight attention heads and 2048 filters are used in multi- tators. They got ‘‘high agreement’’ with kappa scores of
head attention. For the utterances, knowledge, and gener- 0.80, 0.86, 0.81, and 0.72 for fluency, sufficiency, emo-
ated answers, they employed shared vocab and embed- tional content, and knowledge relevance, respectively.
dings. The number 512 is selected as the word embedding According to automatic evaluation results, ITDD,
dimension empirically. For around 200 epochs, the dis- EmpTransfo, and ECM showed that the models learn to
criminator and generator networks are alternately trained. decode lexically relevant replies with substantial diversity
They employed the ADAM optimizer for the generator, on both datasets and Figs. 32, 33 showed that the proposed
whose learning rate is set at 0.0001. model has stronger unigram and bigram diversities as
Also, the authors in Varshney et al. [162] used one of the compared to the baseline models. Due to a strong Div. (n =
most popular metrics for evaluating sequences like BLEU, 1) and Div. (n = 2) score, they saw substantially fewer
perplexity (PPL), and n-gram diversity (Div.) to automat- repetitive segments in the answer produced by the sug-
ically evaluate the quality of generated responses. Fig- gested EmoKbGAN model. Their findings are equivalent to
ures 32, 33 depict evaluation results using automatic and those of the baseline models in terms of BLEU score
human evaluation metrics for baselines, ablation, and the performance on the Topical Chat dataset. This may be
proposed model on Topical document and CMU-DOG explained by the way BLEU uses n-grams to match tokens
datasets. from the expected and target answers [162].
They also measured the quality of the generated text In some cases, the response may be factually and cul-
from a human perspective, they randomly sample 100 turally correct yet employ synonyms that do not precisely
conversations from each model, and with the help of ten reflect the real response. They suggested EmoKbGAN
Fig. 32 Evaluation results using automatic and human evaluation metrics for baselines, ablation, and the proposed model on the Topical Chat
Frequent dataset [162]
123
Neural Computing and Applications
Fig. 33 Evaluation results using automatic and human evaluation metrics for baselines, ablation, and the proposed model on the CMU-DoG
dataset Varshney et al. [162]
outperforms the baseline models on BLEU scores for in extracting the relevant data from the linked knowledge
CMU-DoG. In particular, EmoKbGAN significantly out- base, the knowledge relevance score also appears to have
performs EmoKb-Seq2SeqGAN and EmoKb-Trans- improved. As mentioned in the Baselines section, they also
formerGAN in terms of BLEU. This finding suggests that performed experiments using pre-trained language models
EmoKbGAN effectively integrates the context and perti- like GPT, DialogGPT, BERT, and BART apart from
nent knowledge base, leading to more varied responses. ITDD, ECM, and EmpTransfo. On human evaluation also,
When only the generator portion of the model is used, they they observed that even though these models exhibit
saw an increase in PPL scores and a fall in BLEU scores, competitive performance, the suggested approach
illustrating the potency of our attribute-specific discrimi- EmoKbGAN exceeds them by a significant margin [162].
nators in the design. Although the suggested EmoKbGAN However, when comparing EmoKbGAN with the abla-
performs similarly to KbGAN and EmoGAN for the dis- tion models, authors Varshney et al. [162] found that their
tinct metric, it significantly outperforms them for the model can appropriately consume information and emotion
BLEU and PPL measures. This proves that the model can while producing remarkably consistent answers. They
provide more linguistically correct responses than before noted that the performance is similar when the discrimi-
when attribute-specific discriminators are used jointly. nators are used independently. But when they joined, they
When compared to the EmoKbGAN model, they also saw convergence and a boost in the overall effectiveness of
noticed a decline in the scores of the EmoKbGAN-SD and the suggested approach. Because the attribute-specific
EmoKbGAN-PD models. This demonstrates the decoder’s discriminators outperformed KbGAN and EmoGAN mod-
twin decoding function in action [162]. els in terms of adequacy, emotional content, and knowl-
According to human evaluation results, the models that edge relevance scores, they validated their use. The authors
integrate information tend to produce replies that are more should have also pointed out that their recommended
understandable than the models that do not. Figures 32, 33 approach, EmoKbGAN, outperformed the EmoKbGAN-
show that EmoKbGAN performs better than the other SD and EmoKbGAN-PD models. This shows how the
baseline models on both datasets in terms of fluency, suf- decoder’s twin decoding feature works.
ficiency, emotion quality, and knowledge relevance. The Finally, we presented a comprehensive overview of the
improvement in the fluency and sufficiency scores com- generative models, their types, disadvantages, and advan-
pared to baseline models demonstrates that the suggested tages in terms of improving image quality, as well as their
model produces responses that are more relevant and flu- application in emotion recognition.
ent. The elicited answers are more in line with the emo-
tional sensitivity of the statements, according to the
emotional content score. Indicating a general improvement
123
Neural Computing and Applications
Emotions have a critical role in our decision-making, – Using facial expressions to predict emotions.
planning, reasoning, and other mental processes. Advanced Facial expressions are vital in the understanding of
driver assistance systems to recognize certain emotions emotions and non-verbal communication. They are
(ADAS). Drivers who monitor their emotions while driving significant for everyday emotional communication
receive crucial input that helps them avoid accidents. The [119].
significance derives from the fact that aggressive driving They are also a feeling indication, allowing a person
on the road leads to traffic accidents. to express his emotional condition [138].
’Emotion identification’ can be done utilizing facial People can instantly detect a person’s emotional
expressions, voice, and text, as well as biosignals such as state from facial expressions. As a result, researchers
the electroencephalograph (EEG), blood volume pulse frequently employed information on facial expressions
(BVP), electrocardiogram (ECG), electromyogram (EMG), in automatic emotion identification systems [138].
galvanic skin response (GSR), respiration (RSP), and Because of its considerable academic and commer-
combination of more than one signal. This section presents cial potential, FER is a hot topic in computer vision and
a detailed overview of each one of the state-of-the-art artificial intelligence disciplines. This category con-
strategies for Emotion recognition using the ML approa- centrates on research that mainly uses facial images, as
ches indicated above [9]. visual expressions are one of the most important
Methods for recognizing emotions, in general, can be information channels in interpersonal communication
divided into two groups:- [28].
The computer vision community considers detecting
– One methodology is to use one modality (uni-modal)
human emotion based on facial expressions a chal-
for recognizing emotions humans such as facial
lenging issue due to numerous challenges such as dif-
expressions [115], speech signals, written text, body
ferences in face shape from person to person, difficulty
gestures, posture, and so on, which are easy to collect
in recognizing dynamic facial features, low image
and have been researched for years [145]. However,
quality, and so on Said and Barr [135].
reliability cannot be guaranteed, as it is relatively easy
The main problem when using facial expressions for
for people to control physical signals such as facial
identifying emotions is that they are prone to making.
expressions or speech, especially during social interac-
The person can hide or conceal his real emotions from
tions. People may smile in a formal social setting even
his facial expressions. Face detection and emotion
if they are experiencing negative emotions.
recognition were and still until now a research topic
Whereas signals such as electroencephalogram
that needs enhancement. Researchers presented several
(EEG), body temperature (T), electrocardiogram
ways to enhance solutions to this problem using deep
(ECG), electromyogram (EMG), galvanic skin response
learning approaches to advance the state-of-the-art and
(GSR), respiration (RSP), and other internal data [56]
push the boundaries of traditional handcrafted tech-
are not easy to control, some of these signals are very
niques [135]. To conclude, faces are considered an
intrusive such as EEG, they interrupt normal activities.
intrusive method for emotion identification because the
Other signals are non-intrusive, such as signals col-
person must be facing the camera to take the image of
lected from smartwatches and wristbands.
his face. Neural networks have achieved great success
– In the second category (multi-modal), researchers
in recognizing emotions from facial expressions.
employed more than one modality of the above-
– Emotion Detection from text
mentioned signals to identify emotions.
Nowadays, writings come in various formats, includ-
Recent developments in wearables have different
ing social media posts, microblogs, news pieces, and
types of embedded sensors capable of measuring many
more. With the development of Web 0.2, people are
physiological signals simultaneously and in a non-
now able to express their opinions and feelings by
intrusive way. This enabled the creation of multi-modal
writing. Researchers use the content of these postings
datasets and consequently multi-modal emotion recog-
for text mining and sentiment analysis.
nition models [114, 145].
Sentiment analysis is the extraction of emotions
from these messages and it is a massive and challenging
task. Academics from several domains are attempting
to develop methods for more precise detection of
human emotions from various sources, including text
123
Neural Computing and Applications
[183]. emotions.
Researchers applied many word-based, sentence- It’s commonly believed that using body language is
based strategies, machine learning, and natural lan- just another way to convey the same fundamental
guage other ways to obtain improved accuracy. Emo- emotions, like those shown through facial expressions.
tion analysis can be beneficial processing methods, in a Furthermore, researchers employed the same muscles to
variety of situations. detect emotions throughout cultures, according to
Oxford Dictionary defines ’emotion’ as ‘‘a powerful Atanassov et al. [10].
feeling arising from one’s circumstances, mood, or
interactions with others,’’ while ’sentiment’ is ‘‘a view
4.2 Emotion recognition using physiological
or opinion being held or expressed.’’ A powerful
and speech signals
sensation such as love or rage, or strong feelings in
general according to the Cambridge Dictionary. How-
Physiological signals have many challenges: First, they are
ever, sentiment is a notion, opinion, or idea based on a
always collected from people while moving. This makes
feeling about a circumstance or a way of thinking about
them prone to noise. Second, there are many variations
something [6].
among people in the measurements of these signals. Third,
Sentiments are either ’Positive,’ ’Negative,’ or
some of these signals are invasive, such as EEG signals,
’Neutral,’ respectively. Sentiment analysis extracts
which require a person to wear an ahead-set, which is
meaningful information from the text to determine the
unpractical in real-life applications. An overview of the
attitudes of people toward various things such as a
late state of emotion recognition approaches using speech
product, service, or event. sentiment analysis is a type
signals and different physiological is presented in Saxena
of emotion detection [6, 16].
et al. [141], Wang et al. [167], Ali et al. [4]. This category
Due to the real-time and pervasiveness nature of
is usually used as a multi-modal approach, combining
smartphones and social networking platforms, many
multiple signals.
people prefer to share their feelings and opinions, and
The ability to perceive and interpret driver emotions
other information using visual and textual methods.
while driving and perform appropriate actions is one of the
Most people are still using text for communicating their
primary priority areas listed by international research
ideas and feeling in their daily routine using social
groups for advancing intelligent transportation systems
media.
[175].
There are many challenges for sentiment analysis. In
However, recognizing the mental state of an individual
some cases, a single piece of text may include mixed
and responding while driving is a challenging task that
emotions. Then there are ambiguous emotions and
remains a scientific problem. One of the main difficulties is
words in some documents. Some words have many
that emotion-related signal patterns can vary widely from
meanings, and multiple phrases might refer to the same
person to person or from one setting to the other. Fur-
feeling. Some of the text is sarcastic or includes slang.
thermore, due to the difficulty in precisely defining emo-
Multilingual text, misspellings, acronyms, and gram-
tions and their meanings, it is difficult to determine a
matically incorrect sentences are all features of Internet
perfect association between the classes (patterns) [44].
texts. Emotion extraction from text is a hot research
Using suitable sensors, however, a driver’s emotion and
topic. Researchers from all across the world are
reaction can be caught and measured. Most emotion
interested in modifications, improvements, and new
identification researchers have concentrated on analyzing a
approaches to handling the challenges of this work
particular sensor data type, such as audio (speech) or video
Alswaidan and Menai [6], Nandwani and Verma [120],
(facial expression) data [73]. Many recent studies in the
Bharti et al. [16]. This approach is also a uni-modal
emotion recognition field have begun to incorporate dif-
technique. It can be implemented either by lexicon or
ferent sensor data to construct a powerful emotion identi-
ML.
fication system.
– Emotion Recognition from gesture and posture
The main goal of combining many sensors is to simulate
There has been a boost in interest in emotion
human thinking. Humans always use a variety of modali-
recognition algorithms that utilize facial expressions,
ties to portray emotions during interactions. Researchers
body posture, and gestures over the last decade.
classified human modalities into audiovisual (facial
Emotion recognition methods based on facial expres-
expression, voice, gesture, posture, etc.) and physiological
sion, body postures, and gestures depend on the same
(respiration, skin temperature, etc.) [159].
[10] hypothesis as EMG claims that body postures and
gestures are also involved in the response of emotions
[91, 139] and are suitable for recognizing basic
123
Neural Computing and Applications
The general methods for recognizing a person’s emotion Figure 34 depicts Skin Conductance Response in EDA
are speech, facial expression, or gesture. The speech signal Signal [45].
can determine the emotional state of the speaker [159]. • Heart rate (HR) OR (ECG)
During the activation of the sympathetic nervous system The Sino-atrial node, which generates an electrical
when feelings such as anger, fear, or joy, speech becomes impulse, initiates an orderly progression of depolariza-
loud and fast [97]. When a person feels sad, his parasym- tion in each healthy heartbeat. This impulse travels into
pathetic nervous system is active, and his speech becomes the heart muscle, causing the heart to contract. Elec-
slower. The problem with speech signals is similar to facial trical variations are associated with the buildup of
expression, the possibility of concealing one’s emotions by action potentials moving along the heart muscle.
pretending the opposite. Scientists can use electrodes in this study to measure
Detecting a subject’s physiological pattern, on the other the electrical impulses generated by the heart on the
hand, can provide information about emotions because skin’s surface over time [125]. This kind of recording is
when a participant is positively or adversely excited, the known as ECG.
sympathetic nerves of the autonomic nervous system are Innovative and resilient technology for collecting
activated [97]. Sympathetic activation increases blood emotion-related physiological data over a long period
pressure, boosts respiration rate, and raises the heart rate has been proposed [134].
[97]. The most common physiological signals used for This system will not restrict users’ behavior (non-
emotion recognition include the following: invasive) and can extract perfect physiological data
following the real-world environment using wireless
• Electromyography (EMG) This term refers to a mus-
transmission technology. The Emotion Check [125] is a
cle’s activity or the frequency with which it is tense.
wearable gadget that can detect users’ heart rates and
When muscular cells are electrically or neurologically
control their anxiety.
engaged, EMG detects the electrical potential created
In ECG technology, an electrocardiograph is used to
by these cells [109]. When you’re stressed, you’re
capture the variations in the heart’s electrical activity as
likely to have a lot of muscle tension. It can also
it occurs on the skin throughout each cardiac cycle. A
distinguish between negative and positive emotions by
physiological signal generated by the heart’s contrac-
measuring muscle activity.
tion and recuperation is observed by an ECG. ECG data
• Electrodermal activity (EDA) OR GSR
with a physiological foundation are directly related to a
Skin conductivity (SC) is a measure of the conduc-
person and are regularly used to assess a person’s
tivity of the skin, which increases when it sweats. This
psychological state [93].
signal is a good and sensitive indicator of stress and
• Electroencephalogram (EEG)
other stimuli and a tool for distinguishing between
The electroencephalography signal is the measure-
conflict-free and anger or fear scenarios. This signal has
ment of brain waves and the assessment of brain
an issue such that external factors such as temperature
activity. Currents flow during synaptic excitations of
can influence it. As a result, Stržinar et al. [152] this
the dendrites of numerous pyramidal neurons in the
signal needs the use of reference measurements and
cerebral cortex, resulting in brain waves [77]. They can
calibration. The skin conductance response (SCR) in an
measure EEG signals using small, flat metal disks
EDA signal occurs in response to a stimulus [165].
(electrodes) connected to the scalp. The varying
frequency ranges of the five primary brain waves
identify them. These frequency bands, which range
from low to high, are referred to as in Karaca et al. [77]:
1. Delta (d): waves have a frequency range of 0.5 to 4 Hz.
They frequently appear while deep sleeping and may
walk.
2. Theta (h): waves occur in a frequency range of 4 to 7.5
Hz. When they appear during slumber, they are related
to enhanced learning, creativity, profound meditation,
and unconscious access. Theta waves appear to be
associated with arousal levels.
3. Alpha (a): waves lie within the range of 8–13 Hz. In
general, the Alpha wave appears as a round or
Fig. 34 Ideal skin conductance response (SCR) in the EDA signal
[45]
123
Neural Computing and Applications
sinusoidal-shaped signal and is associated with relax- Determining the relationship between on-body and
ation and super learning. environmental elements. They addressed various studies in
4. Beta (b): Between 14 and 26 Hz, beta waves occur. this area to assess the link between on-body and environ-
They’re related to things like active thinking, active mental reactions. Noise, air pollution, traffic, and even
attention, and problem-solving. congested areas can cause serious health problems, such as
5. Gamma (c): waves correspond to high values over 30 headaches, sleep problems, and heart disease. According to
Hz. These data can be detected and used to detect the the impact of environmental and physiological factors on
presence of specific brain problems. emotion recognition, the total accuracy (86%) of this study
Figure 35 illustrates the four brain waves with their usual Kanjo et al. [75] is based on the combination of multi-
amplitude levels. modal classifiers (SVM, RF, and KNN).
They used a deep learning approach in Kanjo et al. [76]
4.3 Emotion recognition using information for emotion categorization through an iterative process of
fusion and physiological measurements adding and removing many sensor signals from various
modalities in a real-world investigation employing smart-
Recently smartphones and a variety of wearable devices, phones and wearable devices. It incorporated the local
such as smartwatches and wristbands are supplied with interactions of three sensor modalities: on-body, environ-
various sensors, to continually monitor human physiolog- mental, and location, into a global model that reflects
ical signals (such as heart rate, movements, EDA, and body signal dynamics and the temporal links correlating them.
temperature), as well as data from the surrounding envi- This method used various learning algorithms on the raw
ronment (e.g., noise, brightness, etc.). As a result, massive sensor data, including a hybrid approach that integrated
databases have sprung up in different categories of convolutional neural networks and long short-term memory
research, including healthcare and smart cities. This surge recurrent neural networks (CNN-LSTM).
of on-body and environmental data is a good opportunity The results revealed that deep learning approaches were
for healthcare research, necessitating the development of effective in human emotion categorization (average accu-
new tools and methodologies for dealing with enormous racy 95% and F-Measure = 95%), and hybrid models beat
multidimensional datasets [76]. Also, this boosted research standard fully connected deep neural networks (average
in multi-modal emotion recognition. accuracy 73%) and F-Measure = 73%) when using a wide
The study Kanjo et al. [75] constructed a user-dependent range of sensors. The hybrid models also outperformed
prediction emotion model based on sensor data collected previously developed Ensemble approaches that used fea-
from participants walking around Nottingham city center ture engineering to train the model (average accuracy 83%
with a smartphone and wristband 2, which incorporates and F-Measure = 82%) [21, 76].
physiological (HR, EDA, b-temp, Motion) and environ- By allowing robots to understand emotions and body
mental (UV, Noise, air pressure) factors. The researchers movements and react accordingly. Emotion recognition
used three methods to build this model: technology could improve human–machine interaction,
enhancing user experience.
Table 5 Ilyas et al. [70] is an example of fusing or
combining more than one modality to achieve higher
accuracy levels and depicts various accuracy levels for
each methodology. It can be noticed that fusing multiple
modalities improves the accuracy of the model.
The model presented in this research [70] detects emo-
tions (anger, disgust, happiness, fear, sadness, surprise,
neutral) using upper body movements (hand and head
movements) and facial expressions. Tasks like mood and
gesture recognition can be easily detected using face fea-
tures and movement vectors once this correlation has been
mapped. This method employs a deep CNN that had been
trained on benchmark datasets displaying diverse emotions
and body movements.
Features obtained through facial movements and body
motion are fused to get emotion recognition performance.
They used a variety of fusion approaches (feature-level
Fig. 35 Four typical brain waves, from high to low frequencies [77] fusion, decision-level fusion) to combine multi-modal
123
Neural Computing and Applications
Table 5 Results of different evaluation metrics for each frame-based emotion recognition method [70]
Evaluation Facial Upper Body Bimodal Average Bimodal Product Bimodal Bilinear
Matrix Expressions Movement Fusion Fusion Pooling
signals for non-verbal emotion recognition. The algorithm emotional model based on integrating/fusing various
achieved 76.8% emotion recognition accuracy using solely modalities associated with heterogenous sensors (environ-
upper body movements, outperforming 73.1% using the mental and physiological sensors) using ensemble learning
FABO dataset. methods (bagging, boosting, and stacking) including a
Furthermore, using the FABO dataset, multi-modal series of ML algorithms (SVM, DT, NB, and RF) as base
compact bilinear pooling with temporal information out- classifiers to classify five distinct emotional states ranging
performed the state-of-the-art method with an accuracy of from very negative to very positive. The results proved that
94.41%. the stacking ensemble method achieved a higher accuracy
Liisi Kööts researched the influence of weather on level of 98.2% as compared to other ensemble methods.
affective experience (the link between negative and good Also, this study Wang et al. [167] applied information
emotions and weather variables like temperature, relative fusion to predict emotional states. such that the authors
humidity, barometric pressure, and brightness) [82]. Simi- created the Multi-modal Emotion Database with Four
larly, other studies have looked at reactions and their links Modalities (MED4) as the first step in the multi-modal
to wellbeing and physiological changes; however, only one emotion database construction process. MED4 is a col-
of these has looked at merging physiological and wellbeing lection of synchronously recorded signals from partici-
sensors with ecological sensors to forecast and model pants’ speech, facial pictures, photoplethysmography,
emotion [35, 57, 71, 75, 81, 84, 118, 123, 137]. EEG, and photoplethysmography as they responded to
The above-mentioned research encourages the inclusion happy, sad, angry, and neutral emotion-inducing video
of environmental measurements in emotion recognition stimuli. 32 volunteers participated in the study, which was
models. conducted in an anechoic chamber and a research lab with
Information fusion (which includes merging multiple background noise. Four baseline algorithms—identification
data sources to provide consistent and accurate informa- vector ? probabilistic linear discriminant analysis (I-vector
tion) has three levels: ? PLDA), temporal convolutional network (TCN),
(a) Data-level fusion (low-level) tries to combine vari- extreme learning machine (ELM), and multi-layer per-
ous data components from many sensors to com- ception network (MLP)—were created to test the database
plement one another. During data collection, it is and the effectiveness of AER approaches. Additionally,
possible to incorporate other data sources, such as two fusion algorithms were developed to use both internal
user self-reported emotions [31, 49, 70, 167]. and external data on the human state at the feature level
(b) The feature level (intermediate-level data fusion) is and decision level, respectively. The findings demonstrated
used to pick the best set of characteristics for that EEG signals are more accurate in identifying emotions
categorization during data analysis. Using feature- than speech signals (achieving 88.92% in an acoustically
level fusion, the best combination of features, such as quiet environment and 89.70% in one with naturally
EMG, Respiration, Skin Conductance, and ECG, has occurring noise, respectively, vs. 64.67% and 58.92%,
been obtained [57, 70, 167]. respectively). When speech and EEG signals are combined,
(c) The purpose of high-level data fusion (decision- fusion procedures can increase total emotion detection
making) is to improve decision-making by combin- accuracy by 25.92% when compared to speech alone and
ing the outcomes of different methodologies. See 1.67% when compared to EEG in acoustically quiet con-
Field et al. [49], Ilyas et al. [70] for more information ditions, and by 31.74% and 0.96% in naturally noisy con-
on data fusion algorithms and applications in body ditions. Fusion techniques also improve AER’s robustness
sensor networks. in noisy environments.
Table 6 is another example of information fusion that
Information fusion has also been achieved in Younis
depicts physiological health sensors used to monitor human
et al. [174]. This study built a user-independent predictive
123
Neural Computing and Applications
Table 6 List of some on-body sensors that have been used for emotion detection
Sensor Signals and features
Body Despite its simplicity, we can use body temperature to gauge a person’s emotions and mood shifts. Guendil et al. [57], Irrgang
Temperature and Egermann [71], Adibuzzaman et al. [3]. Wan-Young Chung demonstrated that variations in skin temperature, known as
temperature variability (TV), may be used to identify nervous system activity Chung et al. [30]
Heart Rate The RR interval refers to the period between 2 successive pulse peaks, and the signal produced by this sensor consists of
heartbeats. According to many emotion recognition studies, they use HR to measure happiness and emotions Wan-Hui et al.
[166], Colomer Granero et al. [31], Adibuzzaman et al. [3]
EDA Sometimes called galvanic skin resistance, is associated with emotional and stress sensitivity (GSR) Lisetti and Nasoz [94],
Takahashi [154], Kim and André [81], Adibuzzaman et al. [3]
Motion Because modern accelerometers incorporate tri-axial micro-electro-mechanical systems (MEMS) to record three-dimensional
pffiffiffiffiffi
acceleration, the motion equation is as follows: x2 þ y2 þ z2 , where this equation is the root mean square of all three
components. In recent years, they utilized the accelerometer to identify emotions Irrgang and Egermann [71]. In recent
years, they used the accelerometer to identify emotions Irrgang and Egermann [71]
health and are combined with environmental sensors such straightforward to control, others (such as making
as (UV, EnvNoise, and AirPressure) to predict emotional someone angry) are more complex [4].
states. • Dyadic interaction tasks Interacting with several classes
of couples elicits emotion in this game (friends,
4.4 Emotion elicitation methods romantic partners, family members, etc.). This method
has the advantage of presenting a wide range of
There are many stimuli used to elicit emotions in the lit- emotional responses while also allowing researchers
erature. They can be classified as follows: to investigate emotion in social contexts. Some of the
method’s disadvantages are as follows: (1) It needs
• Film clips With this technique, participants are shown
much time and resource commitment; for example,
the entire short film or selected parts. The fundamental
dyadic interaction processes might take anywhere from
benefit of this approach is that it provides access to a
(2–4) h to complete. (2) Some therapies might be
wide range of emotions, including love, rage, fear, joy,
insufficient (the participant switches topics to avoid an
and others. On the other hand, its disadvantages include
emotional outburst). Finally, (3) simply shows emotion
the necessity of separating specific interesting portions
by using the example of Ali et al. [4].
of the shown film. Additionally, because emotions are
• In the wild Experiments The previously mentioned
transitory, delaying the evaluation of that emotion may
methods are called lab experiments. Recent research
result in bias in labeling [83].
tends to use real-world ’in the wild’ experiments. These
• Pictures This method works by displaying a sequence
methods rely on real experiments in real-life settings
of pictures to participants. The advantages of this
such as people performing their daily activities like
method are it is easier to use and has the ability of self-
shopping [75, 174].
reporting. The method’s drawbacks include the lack of
standardization [4, 83]. In general, choosing the elicitation scenario or the stimuli
• Music This method is implemented by playing music depends on the target emotions and available sensors. The
for participants. The advantages are that it is easy and music and picture scenarios, for example, will not be
simple, highly standardized, and emotions develop with beneficial if the researcher needs to extract the speech
time-lapse (15–20 min.). The drawbacks are that the signal from a subject. Emotional behaviors as emotional
music taste might influence the experienced emotions. stimuli and dyadic interaction tasks are both relevant sce-
Additionally, this method gives only the moods either narios, in this case, Ali et al. [4].
positive or negative. It does not give discrete emotions
[51, 71, 81, 170].
• Emotional behaviors as emotional stimuli In this 5 Summary of previous research
situation, the goal is to modify the target person’s on machine learning for emotion
feelings by influencing his/her interpretation of the recognition
behavior. This technique has the advantage of eliciting
emotional responses from a wide range of sources In this section, we will discuss an overview of the author’s
(posture, eye gaze, tone of voice, breathing, and contribution and their findings about each modality men-
emotional actions). On the other hand, while certain tioned above.
occurrences (such as making someone angry) are
123
Neural Computing and Applications
According to the facial expressions recognition modal- for emotional classification using EEG signals are derived
ity, there were some researchers that used facial expres- from the fourteen signals gathered from the EEG signal
sions as uni-modal to predict emotional states. The reader (EPOC?) channels. The features are then presented
following results are examples of research using facial to the LSTM and CNN classifiers after being cross-vali-
expressions for identification. dated five times. With CNN, they were able to detect
The work presented in Tarnowski et al. [155] offered a emotions using facial landmarks with a maximum identi-
method for identifying seven primary emotional states fication rate of 99.81%. However, for emotion identifica-
based on facial expressions: neutral, joy, surprise, anger, tion using EEG signals, the maximum recognition rate
sadness, fear, and disgust. Because the face is the most obtained using the LSTM classifier is 87.25%.
visible area of the body, computer vision systems (often Chowdary et al. [29] achieved an average accuracy of
cameras) can analyze the image of the face to detect emotion (96%) using the CK? database using SVM and
emotions. They employed Microsoft Kinect for 3D face CNN classifiers. Also, Umer et al. [158] used the CNN
modeling in this experiment due to its low cost and ease of algorithm to predict emotion classes (happiness, sadness,
usage. Kinect has a low scanning resolution but a fast fear, disgust, surprise, anger, and neutral) and achieved an
image registration rate (30 frames per second) [155]. It average accuracy level (77.8%) of the KDEF dataset,
contains two cameras and an infrared emitter. Six partici- (87.2%) of GENKI dataset, and (92.8%) of CK? dataset.
pants between the ages of 26 and 50 took part in the study. Bargal et al. [13] introduced an emotion recognition
Each experiment participant sat at a distance of 2 ms from algorithm for videos. First, they cropped photos to the
the Kinect device in a sitting position. A participant’s task required size, then transformed them to grayscale space
was to play mimic effects according to instructions on a color and applied histogram equalization. Second, they
computer screen. Researchers used photographs from the used three famous CNN models for training utilizing the
KDEF database [79] to create the instructions, which AFEW dataset Dhall et al. [37] and another dataset as
include the name of the emotional state and a picture of an additional training data. Third, the three CNN models’
actor performing the relevant imitation effect [155]. This outputs were concatenated and encoded to create a set of
experiment produced a classification accuracy of emotions feature vectors. Fourth, to classify emotions, Bargal et al.
of 96% (3-NN), 90% (MLP) for random division of data. passed feature vectors to an SVM classifier. The suggested
For all users, the classification accuracy for the ‘‘natural’’ method surpasses current state-of-the-art methods with an
partition of data was 73% (for the MLP classifier). In an accuracy of 59.42% when tested on the AFEW dataset.
identical situation, the classification accuracy of the 3-NN Dandil et al. [33] presented a convolutional neural net-
classifier was 10% lower. That demonstrates that neural work-based (CNN) face emotion classifier. Three convo-
networks are capable of generalization [155]. lution layers, one max-pooling layer after the first
To create an algorithm for real-time emotion recognition convolution layer, two average pooling layers after the
using virtual markers through an optical flow algorithm second and third convolution layers, two fully convolution
that works well in unstable conditions, the authors in layers, and a softmax layer make up the proposed CNN.
Hassouneh et al. [61] used convolutional neural network The ViolaJones face identification algorithm [163] was
(CNN) and long short-term memory (LSTM) classifiers to used to detect faces in images. They created a dataset
classify physically disabled people (deaf, dumb, and comprising 3600 images to train and evaluate the suggested
bedridden) and Autism children’s emotional expressions approach and used 240 images as testing data. The pro-
based on facial landmarks and electroencephalograph posed method obtained a maximum accuracy of 72% in the
(EEG) signals. They employed ten virtual markers to evaluation.
gather data on six facial emotions (happiness, sadness, According to the text emotion recognition modality, the
anger, fear, disgust, and surprise). Additionally, 55 college following examples depict the results of previous papers
students with a mean age of 22.9 years (35 male and 25 that used written text to predict emotional states using ML
female) freely participated in the experiment for facial and DL techniques.
emotion identification. Additionally, 19 undergraduate This study Acheampong et al. [2] introduced a com-
students offered their services to gather EEG signals. For prehensive overview of techniques used in written texts to
the first phase of facial and eye detection, Haar-like fea- identify emotional states. The authors used SVM, KNN,
tures are employed. Based on a facial action coding system, MLP, NB, and DT as base classifiers to classify emotional
the Lucas-Kande optical flow method is then used to track states (joy, happiness, sadness, fear, anger, surprise, dis-
the characteristics when virtual markers are later set in gust, neutral, fun, worry, love, hate, enthusiasm, boredom,
specific spots on the subject’s face. The distance between relief, empty, and scared). They obtained the following
each marker point and the subject’s face’s center is used as results: KNN achieved an average accuracy of 83%, SVM
a feature to classify facial expressions. And, characteristics gave an average accuracy of 77%, MLP gave an accuracy
123
Neural Computing and Applications
of 77%, NB achieved an average accuracy of 74%, and DT fear, happy, sad, and surprise) that achieved a mean clas-
achieved an average accuracy of 74%. sification accuracy of 85.0%.
In Nandwani and Verma [120], the authors used NB, The authors in Dzedzickis et al. [44] provided a sum-
SVM, RF, and CNN as base classifiers to classify Furious, mary of the primary relationships between body postures
cheerful, or depressed, positive, negative, and neutral and emotions as in Table 7. They used computer vision
emotional states. They proved that Naive Bayes achieved systems and analysis algorithms that can follow the
an F1 score above 90% in binary classification and an F1 motions of selected reference points to measure facial
score above 60% for the three-class classification of sen- expressions, body posture, and gestures.
timents. Also, they proved that random forest (RF) with an In the realm of emotion recognition, such a measure-
accuracy of 95.6% performed better than the NB classifier. ment approach has preferences since it allows for non-
While the SVM classifier achieved an average accuracy of contact measurements or non-invasive methods and deliv-
85.47%. According to the CNN algorithm, it achieved 80% ers reliable results [44].
accuracy. There are some limitations or drawbacks related to the
This study Bharti et al. [16] tried to solve the limitations presented ’EMG’ such as: (i) It only recognizes strong
that face sentiment analysis that includes emotion detec- emotions that stay a certain amount of time; weak emotions
tion. To extract emotions from text, several approaches or extremely brief, non-intense stimuli do not result in
have been applied in the past using natural language pro- visible facial movements or changes in body posture that
cessing (NLP) techniques, including the keyword can be detected.
approach, the lexicon-based approach, and the machine (ii) Tracking of body postures is hard to define the exact
learning approach. However, due to their focus on semantic position of a reference point covered by clothes, so special
relations, keyword- and lexicon-based techniques have marks for vision systems should be implemented in this
some drawbacks. To identify emotions in text, the authors case [44].
of Bharti et al. [16] suggested a hybrid (machine learning Despite the indicated drawbacks, facial expressions,
? deep learning) model so as to improve the results. Deep body posture, and gesture tracking are still promising tools
learning techniques were used, including Bi-GRU and in the emotion recognition domain. Tables 9, 10 provide a
convolutional neural networks (CNNs). Approaches to summary of studies, including the analysis of facial
machine learning that are utilized are the support vector expressions, body posture, and implemented emotions [44].
machine, random forest, naive Bayes, and decision tree. It is evident that in a majority of research, facial
Sentences, tweets, and dialogues are the three types of expressions, body posture, and gestures analysis methods
datasets used to assess the performance of the hybrid were used together and complemented by other techniques
approach. The ability to work with multi-text sentences, to improve recognition accuracy in Table 9 [44]. When
tweets, dialogues, keywords, and vocabulary words of comparing methods for analyzing facial expressions, body
easily detectable emotions are some of the benefits they position, and gestures with those previously discussed. It
illustrated for the suggested approach. They got the fol- can be noted that these methods are one of the most
lowing outcomes: SVM provides the maximum accuracy of promising methods for future applications. Particularly in
78.97% when compared to RF, NB, and DT, according to practical applications that do not necessitate great precision
the ML classifier. The CNN model has the highest F1 score and sensitivity due to their broad applicability [44].
(80.76%) and the Bi-GRU model has the highest accuracy This study Ilyas et al. [70] used a combination of ML
(79.46%) when using the DL approach. The hybrid model, and DL methods (CNN, SVM, RNN, RNN-LSTM) as base
which combines CNN, Bi-GRU, and SVM, has an F1 score classifiers to identify (happy, sadness, anger, and fear)
of 81.27, a precision of 82.39%, a recall of 80.40%, and an emotional states based on a combination of facial expres-
accuracy of 80.11%. sions and body gestures and postures. They produced the
Related to body gestures and posture emotion recogni- following results: 77.7% for facial expression, 76.8% for
tion and facial expressions. This study Mittal et al. [114] upper body movements (hand and head movements) fea-
used CNN, and LSTM to classify these emotional states: tures, 85.7% for bimodal average fusion, 86.6% for
(anger, happy, neutral, sad, disgust, fear, and surprise) bimodal product fusion, and 87.2% for bimodal bilinear
based on body gestures and postures and facial expressions pooling for all emotions using CNN.
modalities. The authors in Mittal et al. [114] used two Also, the authors in Raman et al. [130] used random
datasets: IEMOCAP which contains four emotion labels forest (RF), logistic regression (LR), gradient boosting
(angry, happy, neutral, and sad) that achieved an average classifier (GBR), and ridge classifier (RC) to classify 12
accuracy of 78.2% using LSTM, CNN, and CMU-MOSEI emotional states (happy, angry, disagree, disgust, fear,
Dataset which contains six emotion labels (angry, disgust, hello, namaste, okay, sad, shock, surprise, and victorious).
They proved that random forest achieved an average
123
Neural Computing and Applications
Table 7 Relations between emotions and body posture and gestures Metri et al. [111], Lee et al. [88]
Emotions Gestures and Postures
Happiness Body extended, shoulders up, arms lifted up or away from the body
Interest Lateral hand and arm movement and arm stretched out frontal
Surprise Right/left hand going to the head, two hands covering the cheeks self-touch two hands covering the mouth head shaking body shift-
backing
Boredom Raising the chin (moving the head backward), collapsed body posture, and head bent sideways, covering the face with two hands
Disgust Shoulders forward, head downward and upper body collapsed, and arms crossed in front of the chest, hands close to the body
Hot anger Lifting the shoulder, opening and closing hand, arms stretched out frontal, pointing, and shoulders squared
accuracy of 1.00%, logistic regression achieved an average high and low dominance, and finally 86.23% for like-
accuracy of 1.00%, the gradient boosting classifier unlike.
achieved an average accuracy of 96%, and the ridge clas- People’s individual EDA features and musical features
sifier achieved an average accuracy of 1.00%. were combined by the authors of Yin et al. [173], who then
According to physiological signals modality, there are produced a network of residual temporal and channel
many papers that used physiological signals to predict attention. They demonstrated the efficiency of the proposed
emotional states as in Table 11. network for mining EDA features by first applying a
To provide an accurate approach to emotion recognition mechanism of channel-temporal attention for EDA-based
utilizing wearable technology, the authors in Domı́nguez- emotion identification to investigate dynamic and steady
Jiménez et al. [43] proposed a model for the recognition of temporal and channel-wise data.
three emotions: amusement, melancholy, and neutral using The goal of this study Romaniszyn-Kania et al. [131] is
physiological signals. With the help of video clips, 37 to develop a tool and propose a physiological dataset to
volunteers were asked to express the desired emotions complement the psychological data. There were 41 pupils
while two biosignals—galvanic skin reaction and photo- in the study group, ranging in age from 19 to 26. The
plethysmography, which measures heart rate—were being research protocol that was given was built on the acquisi-
monitored. To determine a collection of features, these tion of the electrodermal activity signal using the Empatica
signals were examined in the frequency and time domains. E4 device during three exercises carried out in a prototype
Several classifiers and feature selection strategies were Disc4Spine system and employing psychological research
assessed. The best model was created using support vector techniques. In the context of emotions experienced, various
machines for classification and random forest recursive data clustering and optimization techniques (hierarchical
feature elimination for feature selection. The findings and non-hierarchical) were examined. The k-means clas-
demonstrate that neutral, amused, and depressed emotions sifier performed best during Exercise 3 (80.49%) and when
can all be identified using simple aspects of the galvanic the EDA signal was combined with negative emotions
skin response. The authors were able to recognize the three (80.48%). A comparison of the accuracy of the k-means
target emotions with an accuracy of up to 100% when classification with the independent division made by a
evaluated on the test dataset. psychologist revealed again the best results for negative
With the aid of CNN-based classification of multi- emotions (78.05%).
spectral topological images acquired from EEG signals, Sepulveda et al. [142] provided a wavelet scattering
authors in Ozdemir et al. [122] suggested a novel method algorithm to extract the characteristics of ECG signals
for estimating emotional states. By transforming EEG data based on the AMIGOS database as inputs for various
into a series of multi-spectral topological images, as classifiers, evaluating their performance, and reported that
opposed to the majority of EEG-based techniques, which the accuracy of 88.8%, 90.2%, and 95.3% had been
remove spatial information from EEG signals, temporal, obtained in the valence, arousal, and two-dimensional
spectral, and spatial information of EEG signals are pre- classification, respectively, using the presented algorithm.
served. A series of three-channel topographical images are To distinguish between a driver’s calmness and anxiety,
used to train the deep recurrent convolutional network to Wang et al. [168] used ECG data such as time-frequency
recognize significant representations. The test accuracy that domain, waveform, and nonlinear properties along with
we were able to attain was 90.62% for negative and posi- their previously described model of emotion detection. For
tive valence, 86.13% for high and low arousal, 88.48% for calm and anxiety, accuracy values of 91.34% and 92.89%,
respectively, were attained.
123
Neural Computing and Applications
Li [67] collected 140 signal samples of ECG that were evaluated over the Spark cluster by hyperparameter tuning.
triggered by Self-Assessment Manikin emotion self- In this study, a multi-modal dataset for the examination of
assessment experiments with the International Affective human affective states—the DEAP Dataset—was utilized.
Picture System, and used Wasserstein generative adver- The participants’ labels for each of the 40 1-minute long
sarial network with gradient penalty to add various num- musical clips served as the foundation for the forecasts.
bers of samples for various classes. The outcomes music. Each movie was scored by participants based on its
demonstrated that increasing the amount of data improved level of arousal, valence, likeness or dislike, dominance,
all three classifiers’ accuracy and weighted F1 scores. and familiarity. For each of the 4 classes, a separate set of
Before they began processing the EEG data using a time-segmented, 15-s intervals of epoch data was used to
modified algorithm of the radial basis function neural train the binary class classifiers. The best segmentation
network, Zhang et al. [176] first measured the EEG signals result was achieved using PCA with SVM, which provided
and extracted features from the signals. Then they com- an F1 score of 84.73% with 98.01% recall in the 30th to
pared and discussed the experimental results provided by 45th segmentation interval. Different classification models
various classification models. The results demonstrated that converge to higher accuracy and recall than others for each
the improved algorithm outperformed competing of the time segments and ‘‘a binary training class’’. The
algorithms. findings demonstrate the need for several classification
The EEG data were divided into three emotional states methods to categorize various emotional states.
by Wagh and Vasanth [164], who also used the discrete Lee et al. [89] used a combination of physiological
wavelet transform to break the EEG signal up into its signals to achieve high performance of 80.18% and 75.86%
component frequency bands. To distinguish between vari- for arousal and valence using deep learning autoencoders.
ous emotions, they also extracted temporal domain char- In addition, Bizzego et al. [18] achieves an accuracy of
acteristics from the EEG signal. The results showed that 0.93% (train), 0.94% (test) using DNN and produces an
the highest frequency spectrum performed well in emotion accuracy of 0.64% (train), 0.61% (test) using SVM for
recognition, with maximum classification rates of 71.52% basic emotion classes.
and 60.19%, respectively, when the classification methods To classify emotions, scientists used EMG, RSP, skin
of decision tree and k-nearest neighbor were utilized. temperature (SKT), heart rate (HR), skin conductance
Priyasad et al. [128] proposed a novel method based on (SKC), and blood volume pulse (BVP) as input signals.
a deep neural network-based multi-task classifier to deter- Temporal and frequency parameters are the features
mine the dimensional emotional states (low/high) from retrieved from the EMG. Mean, standard deviation, mean
unprocessed EEG signals. In comparison to state-of-the-art of absolute values of the first and second difference
techniques, our proposed model exceeded them by (MAFD, MASD), distance, and so on are temporal
achieving accuracy levels of 88.24%, 88.80%, and 88.22% parameters. The spectral coherence function’s mean and
for arousal, valence, and dominance, respectively, using standard deviation are the frequency parameters. It had an
10-fold cross-validation; 63.71%, 64.98%, and 61.81% 85% recognition rate for various emotions Gouizi et al.
with Leave-One-Subject-Out cross-validation (LOSO) on [55].
the Dreamer dataset; and 69.72%, 69.43%, and 70.72% for ECG, EMG, and SGR were employed as signals in the
a LOSO evaluation on the DEAP dataset. reference work AlZoubi et al. [7] to categorize eight
A dataset comprising three classes of emotions and a emotions. The mean, median, STD, maxima, minima, the
total of 2100 EEG samples from two participants was used first and second derivatives of the preprocessed signal, and
to test the long short-term memory model that Mohsen the transformation were among the 21 features recovered
et al. [116] presented for the classification of positive, from face EMG and the other signals. The features mean,
neutral, and negative emotions. The provided model had a median, STD, minimum, maximum, minimum rate, and
testing accuracy of 98.13% and a macro average precision maximum rate of the preprocessed signals were employed
of 98.14%, according to experimental findings. in the study Yang and Yang [172] to classify four emo-
In this study Doma and Pirouz [42], epoch data from tions, with a recognition rate of 85% by support vector
EEG sensor channels are analyzed, and multiple machine machine (SVM).
learning techniques—including support vector machine In the study of Xu et al. [171], the authors collected
(SVM), K-nearest neighbor, linear discriminant analysis, EMG, EDA, ECG, and other signals from 8 participants
logistic regression, and decision trees—are compared. using the Biosignalplux research kit, which is a wireless
Each of these models is tested both with and without real-time biosignal acquisition unit with a series of physi-
principal component analysis (PCA) for dimensionality ological sensors. They employed SVM, Naive Bayes (NB),
reduction. Grid search was also used to reduce execution KNN, and Decision Tree (DT), with DT providing the best
time for each of the machine learning models that were accuracy with the ST (Skin Temperature), EDA, and EMG
123
Neural Computing and Applications
signals Xu et al. [171]. They achieved an average recog- classification accuracy (99%) possible. In comparison to
nition accuracy of over 81%. SVM and NB classifiers, ANN proved to be the most
The authors used ECG and GSR signals to distinguish accurate classifier, with an overall accuracy of 98%. The
among three emotions namely, happy, sad, and neutral overall accuracy for time domain features was 92.75% as
[34]. ECG and GSR were retrieved. The emotional clas- compared to entropy and frequency domain features.
sification achieved the following results: 93.32%, 91.42%, This study Hu et al. [68] used skin conductance and
and 90.12% using the SVM classifier to provide high subjective emotion evaluation of pleasure arousal domi-
accuracy for classifying all three emotional states nance to analyze the variations in people’s assessments of
respectively. tactile sensations for beech surfaces of varying shapes and
Hao et al. [60] used the CNN algorithm to predict roughness. They discovered that beech with arc forms
arousal, and valence emotion classes using visual-audio could help a participant preserve some of their mental
stimuli and achieves an accuracy of 81.36% and 78.42% in stability even under conditions of relatively high emotional
speaker-independent and speaker-dependent experiments, reactivity. When it came to how beech was perceived, men
respectively. exhibited a wider range of emotional arousal and a slower
According to physiological and speech stimuli modali- rate of emotional arousal than women.
ties, the following results are examples that used these Wu and Chang [170] conducted an experimental
modalities to identify emotional states. investigation on the effects of music on emotions using
This study Garg et al. [51] trained multiple ML algo- ECG. The findings indicated that the autonomic sympa-
rithms such as Lasso regression, elastic net regression, thetic nervous system was strengthened, repressed, and
ridge regression, kNN, SVR(RBF), SVR(poly), SVR(lin- remained unaffected by fast, moderate, and slow music,
ear), DT, RF, MLP, and AdaBoost using different datasets respectively. Additionally, they proposed the usage of
to predict emotions (arousal and valence) classes using music as a stress reliever.
music stimuli and physiological signals. They produced Andreu-Perez et al. [8] recorded films of the players’
average performance evaluation (RMSE) levels of 0.34 faces while they were playing the video game ‘‘League of
(Lasso), 0.34 (elastic net), 3.91 (ridge regression), 0.27 Legends’’ and used functional near-infrared spectroscopy
(KNN), 0.23 (SVR-rbf), 1250201.29 (SVR-poly), 5.31 to image the players’ brain activity. This information was
(SVR-linear), 0.28 (DT), 0.30 (RF), 50.49 (MLP), 0.26 used to decode the players’ skill level in a multi-modal
(Ada-boost) in arousal dimension, also produces 0.30 framework, marking the first time this has been done using
(Lasso), 0.30 (elastic), 5.16 (ridge), 0.25 (KNN), 0.22 non-restrictive brain imaging technologies. The best tri-
(SVR-RBF), 741397.72 (SVR-poly), 3.65 (SVR-linear), class classification precision, according to them, was
0.26 (DT), 0.26 (RF), 30.49 (MLP), and 0.22 (Ada-boost) 91.44%.
performance evaluations in valence dimension according to There were several experiments in the past for emotion
(PMEmo, DEAM) datasets. recognition using voice and physiological cues have been
This study Kose et al. [83] investigated the potential of undertaken, based on previous methodologies. The
EOG and EMG signals for emotion recognition to classify research began with a subject-dependent technique where
four types of emotions, namely happy, relaxing, angry and the emotion recognition system is only used for only one
sad are considered. The authors in Kose et al. [83] provided user and must be retrained or re-calibrated before being
an improved method for emotion recognition using hori- used for another. The emphasis is currently on subject-
zontal electrooculogram, vertical electrooculogram, zygo- independent approaches where the emotion identification
maticus major electromyogram, and trapezius system is generic (used for any user). Table 8 illustrates a
electromyogram signals. Here, emotions were elicited by short review of previous works in emotion recognition
audio–visual songs. The time domain, frequency domain, using speech and physiological signals.
and entropy-based information are retrieved for the clas- The table illustrates which signals were evaluated, what
sification of emotions. Support vector machines, naive emotion-eliciting stimuli were used, which emotions were
Bayes, and artificial neural networks are used to classify recognized, the number of people in the study, and which
these features. Accuracy, average precision, and average features were extracted and classification algorithms used.
recall are used to compare each classifier’s performance. The table also includes the accuracy of the methods. The
The key achievement of Kose et al. [83] is the identifica- maximum accuracy achieved in the case of subject-de-
tion of time domain features as the optimal characteristics pendent techniques was 96.58% for recognizing three
for EOG and EMG data using the ANN classifier to arousal levels. An accuracy of 95% was achieved for four
achieve maximum classification accuracy. The following emotions. Moreover, a 91.7% accuracy level was obtained
results came from this study: ANN classifier and time for six emotions. Thus, the accuracy levels depend on the
domain features are combined to obtain the highest number of explicit emotions and the type of model.
123
Neural Computing and Applications
Table 8 Previous work on emotion recognition using physiological and speech signals
Ref No Signals Features Classifiers Emotions Stimuli No of Accuracy in %
subjects
Kim and EMG, ECG, Statistical, Linear Discriminant Joy, Anger, Sad, Music 3, MIT 95 (Sub-
André EDA, RSP Energy, Sub- Analysis Pleasure database Dependent),
[81] band 70 (Sub-
Spectrum, Independent)
Entropy
Lisetti and GSR, HR, ST No specific KNN, Discriminant Sadness, Anger, Movies 14 91.7 (Sub-
Nasoz features stated Function Analysis, Fear, surprise, Dependent)
[94] Marquardt back- Frustration,
propagation Amusement
Haag et al. EMG, EDA, Running mean NN Arousal, Valance IAPS (Visual 1 96.58 Arousal
[58] BVP, ECG, Running Affective 89.93
RSP standard Picture System) Valence
deviation (Sub-
Slope Dependent)
Wan-Hui ECG Fast Fourier Tabu Search Joy, Sadness Movies 154 86 (Sub-
et al.[166] Independent)
de Santos EDA, HR No specific fuzzy logic Stress Hyper-ventilation 80 99.5 (Sub-
Sierra features stated Talk Independent)
et al.[36] preparation
Maaoui and BVP, EMG, Statistical SVM, Fisher LDA Amusement, IAPS 10 90 (Sub-
Pruski ST, EDA, Features Contentment, Dependent)
[101] RSP Disgust, Fear, 92 (Sub-
Sad, Neutral Dependent)
Kim [80] EMG, EDA, Statistical KNN Arousal, Valance Quiz dataset 3 92 (Sub-
ECG, BVP, Features, Dependent)
ST, RSP, BRV, Zero- 55 (Sub-
SPEECH crossing, Independent)
MFCCs
Kulic and EDA, HR, No specific HMM Arousal, Valance Robot Actions 36 81 (Sub-
Croft [85] EMG features stated Dependent)
66 (Sub-
Independent)
Regarding subject-independent techniques, on the other modalities and achieved higher results as shown in Sect.
hand, have maximum accuracy of 99.5% for recognizing 4.3.
one emotion (stress), 86% for two emotions, and 70% for Table 9 depicts previous works that used either facial
detecting four emotional states. We can also notice for expressions, body gestures and postures, and physiological
physiological signals that besides the feature extraction and signals as single modalities or a combination of each of
classification approaches, the emotion stimuli type affects them. Table 9 represents the aim of each previous research,
the accuracy of the model. emotions used in each study, used modalities, and experi-
In general, the used sensors, the number of subjects, the mental hardware devices used in each research to extract
emotional states, the used stimuli, the feature extraction, features used to identify emotional states.
and classification methods are the required parameters to Finally, we can summarize the most ML algorithms
build a robust and reliable emotion recognition system Ali used in the emotion recognition modalities that were
et al. [4]. depicted in detail in Sect. 4, 3 according to the results
Most of the work presented in this category is related to mentioned in this Sect. 5 as in Tables 10, 11. Tables 10, 11
subject-dependent models (personalized). Moreover, all the depict previous works from 2020 to 2022 that used dif-
presented studies are in the laboratory. It can also be noted ferent classifiers of ML techniques either using various
that the highest performance of 99.5% accuracy was approaches or modalities such as facial expressions, text,
achieved by combining EDA and HR signals. body gestures, and postures, physiological signals, and
According to information fusion and physiological sig- speech or using a combination of each of them to distin-
nals modalities, there are many papers that used these guish emotional labels used to predict emotional responses.
123
Neural Computing and Applications
Table 9 Review of scientific research work focused on emotion recognition and evaluation by the analysis of facial expressions, body posture,
and gestures
Aim Emotions Methods Hardware and Software References
Presentation of ASCERTAIN-a multi- High/low valence GSR, EEG, GSR sensor, ECG sensor, EEG sensor, Subramanian
modal database for implicit personality and arousal ECG, webcam to record facial activity Lucid et al. [153]
and Affect recognition using commercial HRV, Scribe software
physiological sensors facial
expressions
Creation of a personalized tool for a child Real time arousal Facial Smartphone camera, application Gay et al.
to learn and discuss her feelings and stress level expression CaptureMyEmotion [52]
recognition
This paper aims to explore the limitations Valence and arousal, GSR, facial Infiniti Physiology Suite software; standard Aychet et al.
of the automatic effect recognition interest, slight, expressions internet camera and video capture [11]
applied in the usability context as well as confusion, joy, software from Logitech, Noldus
to propose a set of criteria to select input sense of control FaceReader, Morae GSR recorder
channels for the effect recognition
Present a novel method, for computerized Happiness, interest, Body C?? in Ubuntu 14.04. Kinect for Lee et al. [88]
emotion perception based on posture to boredom, disgust, postures Microsoft Xbox 360 and OpenNI SDK
determine the emotional state of the user hot anger
To propose a novel method to recognize Happiness, sadness, Gestures and Kinect v2 sensor Sapiński et al
seven basic emotional states utilizing surprise, fear, body [140]
body movement anger, disgust and movements
neutral state
Note, as shown in Table 11, according to physiological and of some algorithms such as over-fitting or under-fitting
environmental modality, there is only a recent study in the problems.
year 2022 (From 2020–2022). Third, concerning model generalization, most of the
work presented here is personalized models (a model for
each user). But, there is a need for generalized models to be
used for any user (Generic) or subject-independent models.
6 Challenges and open research avenues These models can be beneficial for reducing the time and
effort required to create the models. They can be trained by
There are many research challenges for emotion recogni- fine-tuning and using transfer learning for personalized
tion from various modalities. These challenges can be models.
classified as follows: First, regarding datasets, there are Finally, regarding model transparency and trust, there is
many datasets available for emotion recognition. Most of a tendency in ML and AI research to open, transparent
them are either uni-model (using only one measurement models, which is essential for gaining user trust in these
such as HR) or in the laboratory collected data. Thus, what models. XAI (explainable artificial intelligence) is a sub-
is missing is creating real-world multi-modal datasets field of AI, which presents methods that explain these
collected in the wild to be used as benchmarks for research models. This field is a promising research area for emotion
experiments and comparing algorithms. In addition, Col- recognition systems.
lecting real data is always a challenge as sometimes the
data collection devices are invasive such as brain signals.
Many sensors are now built-in smartwatches and wrist- 7 Conclusions and future work
bands. So, it can be beneficial to find correlates between
these signals. In this paper, we reviewed about 140 research papers in the
Second, related to classification models, there are also field of emotion recognition. Affective Computing is the
many opportunities to improve the performance of these field of studying emotions. AHER is a powerful and
models in terms of increasing accuracy and decreasing effective approach for assessing human emotional states
error rates. This can be done by using hybrid models such and forecasting human behavior to deliver the most
as CNN combined with other classification algorithms. appropriate marketing or educational strategies accord-
Moreover, using ensemble methods to avoid the drawbacks ingly. It is also beneficial in treating various human–ma-
chine interaction systems.
123
Neural Computing and Applications
Hassouneh et al. CNN, LSTM, Logistic Regression Facial Happiness, sadness, anger, fear, They achieved a maximum
[61], Umer (LR) Hassouneh et al. [61]. CNN Expression disgust, and surprise Hassouneh recognition rate of 99.81%
et al. [158], Umer et al. [158]. CNN, SVM Emotion et al. [61]. Happiness, sadness, using CNN for emotion
Chowdary Chowdary et al. [29] Recognition fear, disgust, surprise, anger, and detection using facial
et al.[29] neutral Umer et al. [158], landmarks, while the maximum
Chowdary et al. [29] recognition rate achieved using
the LSTM classifier is 87.25%
for emotion detection using
EEG signals Hassouneh et al.
[61]. Achieves an average
accuracy of CNN algorithm
(77.8%) of KDEF dataset,
(87.2%) of GENKI dataset,
(92.8%) of CK? dataset Umer
et al. [158]. Chowdary et al.
[29] achieves an average
accuracy of emotion (96%)
using the CK? database using
SVM and CNN classifiers
Acheampong SVM, KNN, MLP, NB, and DT Text emotion Joy, happiness, sadness, fear, KNN achieved an average
et al. [2], Acheampong et al. [2]. NB, recognition anger, surprise, disgust, neutral, accuracy of 83%, and SVM
Nandwani and SVM, RF, CNN Nandwani and fun, worry, love, hate, gave an average accuracy of
Verma [120], Verma [120]. DT, SVM, NB, enthusiasm, boredom, relief, 77%, MLP gave an accuracy of
Bharti et al. and RF as ML classifiers, gated empty, and scared Acheampong 77%, NB achieved an average
[16] recurrent unit (GRU), et al [2]. Furious, cheerful, or accuracy of 74%, and DT
bidirectional gated recurrent unit depressed, positive, negative, achieved an average accuracy
(Bi-GRU), and convolutional and neutral Nandwani and of 74% Acheampong et al. [2].
neural network (CNN) as DL Verma [120]. joy, anger, guilt, The authors in Nandwani and
classifiers Bharti et al. [16] sadness, disgust, fear, and shame Verma [120] proved that Naive
Bharti et al. [16] Bayes achieved an F1 score
above 90% in binary
classification and an F1 score
above 60% for the three-class
classification of sentiments.
Also, they proved that random
forest (RF) with an accuracy of
95.6% performed better than the
NB classifier. While the SVM
classifier achieved an average
accuracy of 85.47%. According
to the CNN algorithm, it
achieved 80% accuracy
Nandwani and Verma [120].
According to ML classifiers:
SVM, RF, NB, and DT
achieved average accuracy
levels (78.97%, 76.25%,
68.94%, 69.42%), while DL
classifiers such as GRU, Bi-
GRU, and CNN achieved the
following: (78.02%, 79.46%,
79.32%) accuracy levels. And,
the authors achieved an average
accuracy level of 80.11% using
a hybrid model that is a
combination of CNN, Bi-GRU,
SVM Bharti et al. [16]
123
Neural Computing and Applications
Table 10 (continued)
Authors Classifiers Methods Emotions Accuracy in %.
Mittal et al. CNN, LSTM Mittal et al. [114]. Facial Angry, happy, neutral, sad, The authors in Mittal et al. [114]
[114], Ilyas CNN, SVM, RNN, RNN-LSTM expression, disgust, fear, surprise Mittal used two datasets: IEMOCAP
et al. [70], Ilyas et al. [70]. Random forest body et al. [114]. Happy, sadness, which contains four emotion
Raman et al. (RF), logistic regression (LR), gestures, anger, and fear in Ilyas et al. labels (angry, happy, neutral,
[130] gradient boosting classifier and [70]. Happy, angry, disagree, sad) that achieved an average
(GBR), and ridge classifier (RC) postures disgust, fear, hello, namaste, accuracy of 78.2% using
Raman et al. [130] okay, sad, shock, surprise, and LSTM, CNN, and CMU-
victorious Raman et al. [130] MOSEI Dataset which contains
six emotion labels (angry,
disgust, fear, happy, sad,
surprise) that achieved a mean
classification accuracy of
85.0%. This study produced
77.7% for facial expression,
76.8% for upper body
movements (hand and head
movements) features, 85.7% for
bimodal average fusion, 86.6%
for bimodal product fusion, and
87.2% for bimodal bilinear
pooling for all emotions using
CNN in Ilyas et al. [70].
Random forest achieved an
average accuracy of 1.00%,
logistic regression achieved an
average accuracy of 1.00%,
Gradient boosting classifier
achieved an average accuracy
of 96%, and ridge classifier
achieved an average accuracy
of 1.00% Raman et al. [130]
This review presented various ML algorithms. In addi- edge existing research depends on subject-dependent
tion to presenting physical signals such as facial expres- techniques or personalized models. To create a generic
sions, text messages, body gestures, postures, physiological AHER system, multi-modal datasets and suitable algo-
signals, speech, and also environmental sensors. They are rithms for emotion identification are needed.
the most commonly used modalities in AHER, which Challenges related to AHER can be summarized as
depend on measurements of various parameters and using challenges concerning data, methods, and models. Col-
data ML methods for emotion recognition. lecting real data is always a challenge as sometimes the
Some researches used uni-model methods (only one data collection devices are invasive such as brain signals.
modality) and others used multi-modal methods (combin- For developing accurate models huge amounts of data
ing more than one modality). Results showed that using collected in the wild are needed. Concerning the methods,
multi-modal methods can have a positive impact on the recent research suggests that hybrid deep learning methods
performance of emotion recognition models. and ensemble learning methods are promising in terms of
Selecting a method of these depends on the nature of the model accuracy. But, in terms of explainability, the pre-
problem and the available data. In this work, we high- sented models are black-box. So, more work is needed to
lighted various studies that contributed to the debate over make these models explainable (understandable by
what constitutes emotion and whether we can experimen- humans).
tally quantify emotions. In addition to, presenting various Future research should contain the following points:-
ML algorithms.
• It will be concentrating more on deploying multi-modal
Given the subjective nature of emotions, developing an
data and approaches to emotion recognition, as com-
efficient method for recognizing different emotional states
bining more than one modality with the use of ML and
remains a significant challenge. The majority of cutting-
data analysis will lead to advances in practical
123
Table 11 The most common ML algorithms used in physiological and speech stimuli, physiological and environmental factors for emotion recognition
Authors Classifiers Methods Emotions Accuracy in %.
Domı́nguez- RF, SVM Domı́nguez-Jiménez et al. [43]. Physiological amusement, sadness, and neutral Domı́nguez- The authors in Domı́nguez-Jiménez et al. [43]
Jiménez et al. Conventional neural networks (CNNs) Ozdemir signals Jiménez et al. [43]. Valence, Arousal, were able to recognize the three target emotions
[43], Ozdemir et al. [122]. Sparse autoencoders Lee et al. [89] Dominance, and Liking emotional states with an accuracy of up to 100% when evaluated
et al. [122], Lee Ozdemir et al. [122]. Arousal, valence Lee et al. on the test dataset. The authors in Ozdemir et al.
et al. [89] [89] [122] achieved test accuracy of 90.62% for
negative and positive Valence, 86.13% for high
and low Arousal, 88.48% for high and low
Dominance, and finally 86.23% for like-unlike.
Neural Computing and Applications
123
Neural Computing and Applications
Declarations
if changes were made. The images or other third party material in this
sensors
and
org/licenses/by/4.0/.
References
1. Abdou MA (2022) Literature review: efficient deep neural net-
works techniques for medical image analysis. Neural Comput
Appl 34(8):5791–5812
2. Acheampong FA, Wenyu C, Nunoo-Mensah H (2020) Text-
based emotion detection: advances, challenges, and opportuni-
ties. Eng Rep 2(7):e12189
Classifiers
123
Neural Computing and Applications
interaction with detection of speaker emotions using convolution 25. Canales L, Martı́nez-Barco P (2014) Emotion detection from
neural networks. Comput Intell Neurosci 2022:746309 text: a survey. In: Proceedings of the workshop on natural lan-
6. Alswaidan N, Bachir MME (2020) A survey of state-of-the-art guage processing in the 5th information systems research
approaches for emotion recognition in text. Knowl Inf Syst working days (JISIC), pp 37–43
62:2937–2987 26. Charisis V, Hadjidimitriou S, Hadjileontiadis L, Uğurca D,
7. AlZoubi O, D’Mello SK, Calvo RA (2012) Detecting natural- Yilmaz E (2015) Emoactivity-an eeg-based gamified emotion
istic expressions of nonbasic affect using physiological signals. HCI for augmented artistic expression: the i-treasures paradigm.
IEEE Trans Affect Comput 3(3):298–310 In: International conference on universal access in human–
8. Andreu-Perez AR, Kiani M, Andreu-Perez J, Reddy P, Andreu- computer interaction, pp 29–40. Springer
Abela J, Pinto M, Izzetoglu K (2021) Single-trial recognition of 27. Chen J, Ro T, Zhu Z (2022) Emotion recognition with audio,
video gamer’s expertise from brain haemodynamic and facial video, EEG, and EMG: a dataset and baseline approaches. IEEE
emotion responses. Brain Sci 11(1):106 Access 10:13229–13242
9. Arsalan A, Anwar SM, Majid M (2022) Mental stress detection 28. Chen J, Yang L, Tan L, Ruyi X (2022) Orthogonal channel
using data from wearable and non-wearable sensors: a review. attention-based multi-task learning for multi-view facial
arXiv preprint arXiv:2202.03033 expression recognition. Pattern Recognit 129:108753
10. Atanassov AV, Pilev DI, Tomova FN, Kuzmanova VD (2021) 29. Chowdary MK, Nguyen TN, Hemanth DJ (2021) Deep learning-
Hybrid system for emotion recognition based on facial expres- based facial emotion recognition for human–computer interac-
sions and body gesture recognition. In: 2021 international con- tion applications. Neural Comput Appl 35:1–18
ference automatics and informatics (ICAI), pp 135–140. IEEE 30. Chung W-Y, Bhardwaj S, Punvar A, Lee D-S, Myllylae R
11. Aychet J, Monchy N, Blois-Heulin C, Lemasson A (2022) (2007) A fusion health monitoring using ECG and accelerometer
Context-dependent gestural laterality: a multifactorial analysis sensors for elderly persons at home. In: 2007 29th annual
in captive red-capped mangabeys. Animals 12(2):186 international conference of the IEEE engineering in medicine
12. Baldi P (2012) Autoencoders, unsupervised learning, and deep and biology society, pp 3818–3821. IEEE
architectures. In: Proceedings of ICML workshop on unsuper- 31. Granero AC, Fuentes-Hurtado F, Ornedo VN, Provinciale JG,
vised and transfer learning, pp 37–49. JMLR Workshop and Ausı́n JM, Raya MA (2016) A comparison of physiological
Conference Proceedings signal analysis techniques and classifiers for automatic emo-
13. Bargal SA, Barsoum E, Ferrer CC, Zhang C (2016) Emotion tional evaluation of audiovisual contents. Front Comput Neu-
recognition in the wild from videos using images. In: Proceed- rosci 10:74
ings of the 18th ACM International Conference on Multimodal 32. Creswell A, White T, Dumoulin V, Arulkumaran K, Sengupta
Interaction, pp 433–436 B, Bharath AA (2018) Generative adversarial networks: an
14. Bernardo JM, Bayarri MJ, Berger JO, Dawid AP, Heckerman D, overview. IEEE Signal Process Mag 35(1):53–65
Smith AFM, West M (2007) Generative or discriminative? 33. Dandıl E, Özdemir R (2019) Real-time facial emotion classifi-
Getting the best of both worlds. Bayesian Stat 8(3):3–24 cation using deep learning. Data Sci Appl 2(1):13–17
15. Berrar D (2018) Bayes’ theorem and naive bayes classifier. In: 34. Das P, Khasnobish A, Tibarewala DN (2016) Emotion recog-
Encyclopedia of bioinformatics and computational biology: nition employing ECG and GSR signals as markers of ans. In:
ABC of bioinformatics, 403 2016 conference on advances in signal processing (CASP),
16. Kumar BS, Varadhaganapathy S, Kumar GR, Kumar SP, pp 37–42. IEEE
Mohamed B, Karanja HS, Amena M (2022) Text-based emotion 35. Datcu D, Rothkrantz L (2009) Multimodal recognition of
recognition using deep learning approach. Computat Intell emotions in car environments. DCI &I 2009
Neurosci 2022:2645381 36. de Santos A, Sierra CS, Guerra ÁJ, Casanova DP, Bailador G
17. Biedebach L, Rusanen M, Leppänen T, Islind AS, Thordarson (2011) A stress-detection system based on physiological signals
B, Arnardottir E, Óskarsdóttir M, Korkalainen H, Nikkonen S, and fuzzy logic. IEEE Trans Ind Electron 58(10):4857–4865
Kainulainen S et al (2023) Towards a deeper understanding of 37. Dhall A, Goecke R, Lucey S, Gedeon T (2012) Collecting large,
sleep stages through their representation in the latent space of richly annotated facial-expression databases from movies. IEEE
variational autoencoders Multimed 19(03):34–41
18. Bizzego A, Gabrieli G, Esposito G (2021) Deep neural networks 38. Dhariwal P, Nichol A (2021) Diffusion models beat GANs on
and transfer learning on a multivariate physiological signal image synthesis. Adv Neural Inf Process Syst 34:8780–8794
dataset. Bioengineering 8(3):35 39. Dimitriadis SI, Liparas D, Initiative ADN et al (2018) How
19. Sam B-T, Adam L, Yang L, Willcocks Chris G (2021) Deep random is the random forest? Random forest algorithm on the
generative modelling: a comparative review of VAEs, GANs, service of structural imaging biomarkers for alzheimer’s disease:
normalizing flows, energy-based and autoregressive models. from alzheimer’s disease neuroimaging initiative (adni) data-
IEEE Trans Pattern Anal Mach Intell 44:7327 base. Neural Regener Res 13(6):962
20. Borod JC, Madigan NK (2000) Neuropsychology of emotion 40. Wenhao D, Haohong L, Bo L, Ding Z (2023) Causalaf: causal
and emotional disorders: an overview and research directions. autoregressive flow for safety-critical driving scenario genera-
In: The neuropsychology of emotion, pp 3–28 tion. In: Conference on robot learning, pp 812–823. PMLR
21. Briggs D (2003) Environmental pollution and the global burden 41. Doersch C (2016) Tutorial on variational autoencoders. arXiv
of disease. Br Med Bull 68(1):1–24 preprint arXiv:1606.05908
22. Buehlmann P (2006) Boosting for high-dimensional linear 42. Doma V, Pirouz M (2020) A comparative analysis of machine
models. Ann Stat 34(2):559–583 learning methods for emotion recognition using EEG and
23. Bühlmann PL (2003) Bagging, subagging and bragging for peripheral physiological signals. J Big Data 7(1):1–21
improving some prediction algorithms. In: Research report/ 43. Domı́nguez-Jiménez JA, Campo-Landines KC, Martı́nez-Santos
seminar für Statistik, Eidgenössische Technische Hochschule JC, Delahoz EJ, Contreras-Ortiz SH (2020) A machine learning
(ETH), vol 113. Seminar für Statistik, Eidgenössische Tech- model for emotion recognition from physiological signals.
nische Hochschule (ETH), Zürich Biomed Signal Process Control 55:101646
24. Calvo RA, Kim SM (2013) Emotions in text: dimensional and
categorical models. Comput Intell 29(3):527–543
123
Neural Computing and Applications
44. Dzedzickis A, Kaklauskas A, Bucinskas V (2020) Human 64. Ho Y-H, Chang C-P, Chen P-Y, Gnutti A, Peng W-H (2022b)
emotion recognition: review of sensors and methods. Sensors Canf-VC: conditional augmented normalizing flows for video
20(3):592 compression. In: Computer vision–ECCV 2022: 17th European
45. Egger M, Ley M, Hanke S (2019) Emotion recognition from conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings,
physiological signal analysis: a review. Electron Notes Theor Part XVI, pp 207–223. Springer
Comput Sci 343:35–55 65. Hoseinzadeh S, Sohani A, Ashrafi TG (2022) An artificial
46. Ekman P (1992) An argument for basic emotions. Cogn Emot intelligence-based prediction way to describe flowing a New-
6(3–4):169–200 tonian liquid/gas on a permeable flat surface. J Therm Anal
47. Fei H, Fan Z, Wang C, Zhang N, Wang T, Chen R, Bai T (2022) Calorim 147(6):4403–4409
Cotton classification method at the county scale based on multi- 66. Houssein EH, Hammad A, Ali AA (2022) Human emotion
features and random forest feature selection algorithm and recognition from EEG-based brain-computer interface using
classifier. Remote Sens 14(4):829 machine learning: a comprehensive review. Neural Comput
48. Feng J, He X, Teng Q, Ren C, Chen H, Li Y (2019) Recon- Appl 34(15):12527–12557
struction of porous media from extremely limited information 67. Hu J, Li Y (2022) Electrocardiograph based emotion recognition
using conditional generative adversarial networks. Phys Rev E via WGAN-GP data enhancement and improved CNN. In:
100(3):033308 Intelligent robotics and applications: 15th international confer-
49. Field T, Diego M, Hernandez-Reif M (2010) Preterm infant ence, ICIRA 2022, Harbin, China, August 1–3, 2022, Proceed-
massage therapy research: a review. Infant Behav Dev ings, Part I, pp 155–164. Springer
33(2):115–124 68. Qianwen H, Li X, Fang H, Wan Q (2022) The tactile perception
50. Garcia-Garcia JM, Penichet VMR, Lozano MD (2017) Emotion evaluation of wood surface with different roughness and shapes:
detection: a technology review. In: Proceedings of the XVIII a study using galvanic skin response. Wood Res 67(2):311–325
international conference on human computer interaction, pp 1–8 69. Hua TK (2022) A short review on machine learning. Authorea
51. Garg A, Chaturvedi V, Kaur AB, Varshney V, Parashar A Preprints
(2022) Machine learning model for mapping of music mood and 70. Ilyas CMA, Nunes R, Nasrollahi K, Rehm M, Moeslund TB
human emotion based on physiological signals. Multimed Tools (2021) Deep emotion recognition through upper body move-
Appl 81:5137 ments and facial expression. In: VISIGRAPP (5: VISAPP),
52. Gay V, Leijdekkers P, Wong F (2013) Using sensors and facial pp 669–679
expression recognition to personalize emotion learning for 71. Irrgang M, Egermann H (2016) From motion to emotion:
autistic children. Stud Health Technol Inform 189:71–76 accelerometer data predict subjective experience of music. PloS
53. Ghojogh B, Ghodsi A, Karray F, Crowley M (2021) Factor ONE 11(7):e0154360
analysis, probabilistic principal component analysis, variational 72. Manjurul Islam MM, Kim J, Khan SA, Kim J-M (2017) Reliable
inference, and variational autoencoder: tutorial and survey. bearing fault diagnosis using Bayesian inference-based multi-
arXiv preprint arXiv:2101.00734 class support vector machines. J Acoust Soc Am 141(2):1–8
54. Goodfellow I, Pouget-Abadie J, Mirza M, Bing X, Warde-Farley 73. Jayanthi K, Mohan S (2022) An integrated framework for
D, Ozair S, Courville A, Bengio Y (2020) Generative adver- emotion recognition using speech and static images with deep
sarial networks. Commun ACM 63(11):139–144 classifier fusion approach. Int J Inf Technol 14(7):3401–3411
55. Gouizi K, Reguig FB, Maaoui C (2011) Analysis physiological 74. Ji G-W, Jiao C-Y, Zheng-Gang X, Li X-C, Wang K, Wang X-H
signals for emotion recognition. In: International workshop on (2022) Development and validation of a gradient boosting
systems, signal processing and their applications, WOSSPA, machine to predict prognosis after liver resection for intrahep-
pp 147–150. IEEE atic cholangiocarcinoma. BMC Cancer 22(1):1–10
56. Grande E (2022) From physiological signals to emotions: an 75. Kanjo E, Younis EMG, Sherkat N (2018) Towards unravelling
integrative literature review. B.S. Thesis the relationship between on-body, environmental and emotion
57. Guendil Z, Lachiri Z, Maaoui C, Pruski A (2016) Multiresolu- data using sensor information fusion approach. Inf Fusion
tion framework for emotion sensing in physiological signals. In: 40:18–31
2016 2nd international conference on advanced technologies for 76. Kanjo E, Younis EMG, Ang CS (2019) Deep learning analysis
signal and image processing (ATSIP), pp 793–797. IEEE, 2016 of mobile physiological, environmental and location sensor data
58. Haag A, Goronzy S, Schaich P, Williams J (2004) Emotion for emotion detection. Inf Fusion 49:46–56
recognition using bio-sensors: first steps towards an automatic 77. Karaca BK, Akşahin MF, Öcal R (2021) Detection of multiple
system. In: Tutorial and research workshop on affective dia- sclerosis from photic stimulation EEG signals. Biomed Signal
logue systems, pp 36–48. Springer Process Control 67:102571
59. Halbouni A, Gunawan TS, Habaebi MH, Halbouni M, Kartiwi 78. Karpathy A, Johnson J, Fei-Fei L (2015) Visualizing and
M, Ahmad R (2022) Machine learning and deep learning understanding recurrent networks. arXiv preprint arXiv:1506.
approaches for cybersecuriy: a review. IEEE Access 02078
60. Hao M, Cao W-H, Liu Z-T, Min W, Xiao P (2020) Visual-audio 79. Khan G, Samyan S, Khan MUG, Shahid M, Wahla SQ (2020) A
emotion recognition based on multi-task and ensemble learning survey on analysis of human faces and facial expressions data-
with multiple features. Neurocomputing 391:42–51 sets. Int J Mach Learn Cybern 11(3):553–571
61. Aya Hassouneh AM, Mutawa MM (2020) Development of a 80. Kim J (2007) Bimodal emotion recognition using speech and
real-time emotion recognition system using facial expressions physiological changes. In: Robust speech recognition and
and EEG based on machine learning and deep neural network understanding, vol 265, pp 280
methods. Inform Med Unlocked 20:100372 81. Kim J, André E (2008) Emotion recognition based on physio-
62. Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic logical changes in music listening. IEEE Trans Pattern Anal
models. Adv Neural Inf Process Syst 33:6840–6851 Mach Intell 30(12):2067–2083
63. Ho J, Saharia C, Chan W, Fleet DJ, Norouzi M, Salimans T 82. Kööts L, Realo A, Allik J (2011) The influence of the weather
(2022) Cascaded diffusion models for high fidelity image gen- on affective experience. J Individ Differ 32:74–84
eration. J Mach Learn Res 23(47):1–33
123
Neural Computing and Applications
83. Kose MR, Ahirwal MK, Kumar A (2021) A new approach for flexible pavement deterioration modeling. J Infrastruct Syst
emotions recognition through EOG and EMG signals. Signal 27(2):04021005
Image Video Process 15(8):1863–1871 104. Mahmoudi MR, Heydari MH, Qasem SN, Mosavi A, Band SS
84. Kreibig SD (2010) Autonomic nervous system activity in (2021) Principal component analysis to study the relations
emotion: a review. Biol Psychol 84(3):394–421 between the spread rates of covid-19 in high risks countries.
85. Kulic D, Croft EA (2007) Affective state estimation for human– Alex Eng J 60(1):457–464
robot interaction. IEEE Trans Robot 23(5):991–1000 105. Maithri M, Raghavendra U, Gudigar A, Samanth J, Barua PD,
86. Lakshmanna K, Kaluri R, Gundluru N, Alzamil ZS, Rajput DS, Murugappan M, Chakole Y, Acharya UR (2022) Automated
Khan AA, Haq MA, Alhussen A (2022) A review on deep emotion recognition: current trends and future perspectives.
learning techniques for IoT data. Electronics 11(10):1604 Comput Methods Programs Biomed 215:106646
87. Larestani A, Mousavi SP, Hadavimoghaddam F, Hemmati- 106. Maji S, Arora S (2019) Decision tree algorithms for prediction
Sarapardeh A (2022) Predicting formation damage of oil fields of heart disease. In: Information and communication technology
due to mineral scaling during water-flooding operations: gradi- for competitive strategies, pp 447–454. Springer
ent boosting decision tree and cascade-forward back-propaga- 107. Majtner T, Bajić B, Herp J (2021) Texture-based image trans-
tion network. J Pet Sci Eng 208:109315 formations for improved deep learning classification. In:
88. Lee SK, Bae M, Lee W, Kim H (2017) Cepp: perceiving the Iberoamerican congress on pattern recognition, pp 207–216.
emotional state of the user based on body posture. Appl Sci Springer
7(10):978 108. Malik R, Singh Y, Sheikh ZA, Anand P, Singh PK, Workneh TC
89. Lee YK, Pae DS, Hong DK, Lim MT, Kang TK (2022) Emotion (2022) An improved deep belief network ids on IoT-based
recognition with short-period physiological signals using network for traffic systems. J Adv Transp 2022:17
bimodal sparse autoencoders. Intell Autom Soft Comput 109. Malus J, Skypala J, Silvernail JF, Uchytil J, Hamill J, Barot T,
32(2):657–673 Jandacka D (2021) Marker placement reliability and objectivity
90. Li P, Pei Y, Li J (2023) A comprehensive survey on design and for biomechanical cohort study: healthy aging in industrial
application of autoencoder in deep learning. Appl Soft Comput environment. Sensors (haie-program 4) 21(5):1830
138:110176 110. McCallum A (2019) Graphical models, lecture2: Bayesian net-
91. Li Y (2012) Hand gesture recognition using kinect. In: 2012 work represention. PDF). Retrieved, 22
IEEE International conference on computer science and 111. Metri P, Ghorpade J, Butalia A (2011) Facial emotion recog-
automation engineering, pp 196–199. IEEE nition using context based multimodal approach. Int J Interact
92. Sabina L, Aleksander A, Stefan T (2023) Self-organizing map Multimed Artif Intell 1:12–15
algorithm for assessing spatial and temporal patterns of pollu- 112. Middya AI, Nag B, Roy S (2022) Deep learning based multi-
tants in environmental compartments: a review. Sci Total modal emotion recognition using model-level fusion of audio–
Environ 878:163084 visual modalities. Knowl-Based Syst 244:108580
93. Lin W, Li C (2023) Review of studies on emotion recognition 113. Mim SS, Logofatu D (2022) A cluster-based analysis for tar-
and judgment based on physiological signals. Appl Sci geting potential customers in a real-world marketing system. In:
13(4):2573 2022 IEEE 18th international conference on intelligent com-
94. Lisetti CL, Nasoz F (2004) Using noninvasive wearable com- puter communication and processing (ICCP), pp 159–166. IEEE
puters to recognize human emotions from physiological signals. 114. Mittal T, Bhattacharya U, Chandra R, Bera A, Manocha D
EURASIP J Adv Signal Process 2004(11):1–16 (2020) M3er: Multiplicative multimodal emotion recognition
95. Guifang L, Huaiqian B, Baokun H (2018) A stacked autoen- using facial, textual, and speech cues. In: Proceedings of the
coder-based deep neural network for achieving gearbox fault AAAI conference on artificial intelligence, vol 34,
diagnosis. Math Probl Eng 2018:5105709 pp 1359–1367
96. Liu H, Lang B (2019) Machine learning and deep learning 115. Mohd TK, Nguyen N, Javaid AY (2022) Multi-modal data
methods for intrusion detection systems: a survey. Appl Sci fusion in enhancing human–machine interaction for robotic
9(20):4396 applications: a survey. arXiv preprint arXiv:2202.07732
97. Llewelyn CJ (2023) Chakras and the Vagus nerve: tap into the 116. Mohsen S, Alharbi AG (2021) EEG-based human emotion
healing combination of subtle energy & your nervous system. prediction using an LSTM model. In: 2021 IEEE international
Llewellyn Worldwide midwest symposium on circuits and systems (MWSCAS),
98. Lopez R, Boyeau P, Yosef N, Jordan M, Regier J (2020) pp 458–461. IEEE
Decision-making with auto-encoding variational bayes. Adv 117. Montero KG, Quispe DMS, Utyiama EM, Santos D, Oliveira
Neural Inf Process Syst 33:5081–5092 HABF, Souto EJP (2022) Applying self-supervised representa-
99. Lövheim H (2012) A new three-dimensional model for emotions tion learning for emotion recognition using physiological sig-
and monoamine neurotransmitters. Med Hypotheses nals. Sensors 22(23):9102
78(2):341–348 118. Montoya MF, Muñoz J, Henao OA (2021) Fatigue-aware
100. Luo C (2022) Understanding diffusion models: a unified per- videogame using biocybernetic adaptation: a pilot study for
spective. arXiv preprint arXiv:2208.11970 upper-limb rehabilitation with SEMG. Virtual Real 27:1–14
101. Maaoui C, Pruski A (2010) Emotion recognition through 119. Nandani Shivani, Nanavati Rohin, Khare Manish (2022) Emo-
physiological signals for human–machine communication. Cut- tion detection using facial expressions. In: Futuristic trends in
ting edge robotics 2010(317–332), pp 11 networks and computing technologies: select proceedings of
102. Madani A, Moradi M, Karargyris A, Syeda-Mahmood T (2018) fourth international conference on FTNCT 2021, pp 627–640.
Semi-supervised learning with generative adversarial networks Springer
for chest x-ray classification with ability of data domain adap- 120. Nandwani P, Verma R (2021) A review on sentiment analysis
tation. In: 2018 IEEE 15th international symposium on and emotion detection from text. Soc Netw Anal Min 11(1):81
biomedical imaging (ISBI 2018), pp 1038–1042. IEEE 121. Ng A, Jordan M (2001) On discriminative vs. generative clas-
103. Madeh Piryonesi S, El-Diraby TE (2021) Using machine sifiers: a comparison of logistic regression and Naive Bayes. In:
learning to examine impact of type of performance indicator on Advances in neural information processing systems, 14
123
Neural Computing and Applications
122. Ozdemir MA, Degirmenci M, Izci E, Akan A (2021) EEG-based 142. Sepúlveda A, Castillo F, Palma C, Rodriguez-Fernandez M
emotion recognition with deep convolutional neural networks. (2021) Emotion recognition from ECG signals using wavelet
Biomed Eng/Biomed Tech 66(1):43–57 scattering and machine learning. Appl Sci 11(11):4945
123. Park N-K, Farr CA (2007) The effects of lighting on consumers’ 143. Shastry KA, Vijayakumar V, Manoj Kumar MV, Manjunatha
emotions and behavioral intentions in a retail environment: A BA, Chandrashekhar BN (2022) Deep learning techniques for
cross-cultural comparison. J Inter Des 33(1):17–32 the effective prediction of alzheimer’s disease: a comprehensive
124. Peng S, Cao L, Zhou Y, Ouyang Z, Yang A, Li X, Jia W, Shui review. In: Healthcare, vol 10, p 1842. MDPI
Yu (2022) A survey on deep learning for textual emotion 144. Shaver P, Schwartz J, Kirson D, O’connor C (1987) Emotion
analysis in social networks. Dig Commun Netw 8(5):745–762 knowledge: further exploration of a prototype approach. J Per-
125. Pham T, Lau ZJ, Annabel Chen SH, Makowski D (2021) Heart sonal Soc Psychol 52(6):1061
rate variability in psychology: a review of HRV indices and an 145. Shoumy NJ (2022) Multimodal emotion recognition using data
analysis tutorial. Sensors 21(12):3998 augmentation and fusion. PhD Thesis, Charles Sturt University,
126. Madeh Piryonesi S, El-Diraby TE (2020) Data analytics in asset Australia
management: cost-effective prediction of the pavement condi- 146. Singh V, Asari VK, Rajasekaran R (2022) A deep neural net-
tion index. J Infrastruct Syst 26(1):04019036 work for early detection and prediction of chronic kidney dis-
127. Madeh Piryonesi S, El-Diraby TE (2020) Role of data analytics ease. Diagnostics 12(1):116
in infrastructure asset management: overcoming data size and 147. Singh YB, Goel S (2022) A systematic literature review of
quality problems. J Transp Eng Part B Pavements speech emotion recognition approaches. Neurocomputing
146(2):04020022 492:245–263
128. Priyasad D, Fernando T, Denman S, Sridharan S, Fookes C 148. Sohl-Dickstein J, Weiss E, Maheswaranathan N, Ganguli S
(2022) Affect recognition from scalp-EEG using channel-wise (2015) Deep unsupervised learning using nonequilibrium ther-
encoder networks coupled with geometric deep learning and modynamics. In: International conference on machine learning,
multi-channel feature fusion. Knowl-Based Syst 250:109038 pp 2256–2265. PMLR
129. Raheel A, Majid M, Alnowami M, Anwar SM (2020) Physio- 149. Srivastava A (2021) Impact of k-nearest neighbour on classifi-
logical sensors based emotion recognition while experiencing cation accuracy in knn algorithm using machine learning. In:
tactile enhanced multimedia. Sensors 20(14):4037 Advances in smart communication and imaging systems,
130. Raman S, Patel S, Yadav S, Singh V (2022) Emotion and ges- pp 363–373. Springer
ture detection. Int J Res Appl Sci Eng Technol 10:3731–3734 150. Staudemeyer RC, Morris ER (2019) Understanding LSTM—a
131. Romaniszyn-Kania P, Pollak A, Danch-Wierzchowska M, Kania tutorial into long short-term memory recurrent neural networks.
D, Myśliwiec AP, Pitka E, Mitas AW (2020) Hybrid system of arXiv preprint arXiv:1909.09586
emotion evaluation in physiotherapeutic procedures. Sensors 151. Stock-Homburg R (2022) Survey of emotions in human–robot
20(21):6343 interactions: perspectives from robotic psychology on 20 years
132. Rombach R, Blattmann A, Lorenz D, Esser P, Ommer B (2022) of research. Int J Soc Robot 14(2):389–411
High-resolution image synthesis with latent diffusion models. 152. Stržinar Ž, Sanchis A, Ledezma A, Sipele O, Pregelj B, Škrjanc
In: Proceedings of the IEEE/CVF conference on computer I (2023) Stress detection using frequency spectrum analysis of
vision and pattern recognition, pp 10684–10695 wrist-measured electrodermal activity. Sensors 23(2):963
133. Russell JA (1980) A circumplex model of affect. J Pers Soc 153. Subramanian R, Wache J, Abadi MK, Vieriu RL, Winkler S,
Psychol 39(6):1161 Sebe N (2016) Ascertain: emotion and personality recognition
134. Saganowski S (2022) Bringing emotion recognition out of the using commercial sensors. IEEE Trans Affect Comput
lab into real life: recent advances in sensors and machine 9(2):147–160
learning. Electronics 11(3):496 154. Takahashi K (2004) Remarks on SVM-based emotion recogni-
135. Said Y, Barr M (2021) Human emotion recognition based on tion from multi-modal bio-potential signals. In: RO-MAN 2004.
facial expressions via deep learning on high-resolution images. 13th IEEE international workshop on robot and human inter-
Multimed Tools Appl 80(16):25241–25253 active communication (IEEE Catalog No. 04TH8759),
136. Sailunaz K, Dhaliwal M, Rokne J, Alhajj R (2018) Emotion pp 95–100. IEEE
detection from text and speech: a survey. Soc Netw Anal Min 155. Tarnowski P, Kołodziej M, Majkowski A, Rak RJ (2020) Eye-
8(1):1–26 tracking analysis for emotion recognition. Comput Intell Neu-
137. Salama ES, El-Khoribi RA, Shoman ME, Wahby MA, Shalaby rosc 2020:2909267
S (2021) A 3d-convolutional neural network framework with 156. Thakkar A, Lohiya R (2023) Fusion of statistical importance for
ensemble learning techniques for multi-modal emotion recog- feature selection in deep neural network-based intrusion detec-
nition. Egypt Inform J 22(2):167–176 tion system. Inf Fusion 90:353–363
138. Salmi A, Li J, Holtta-Otto K (2023) Automatic facial expression 157. Tomczak JM (2022) Deep generative modeling. Springer
analysis as a measure of user-designer empathy. J Mech Des 158. Umer S, Rout RK, Pero C, Nappi M (2022) Facial expression
145(3):031403 recognition with trade-offs between data augmentation and deep
139. Saneiro M, Santos OC, Salmeron-Majadas S, Boticario JG learning features. J Ambient Intell Hum Comput 13(2):721–735
(2014) Towards emotion detection in educational scenarios from 159. Vala Jaykumar M, Jaliya Udesang K (2023) Analytical review
facial expressions and body movements through multimodal and study on emotion recognition strategies using multimodal
approaches. Sci World J 2014:15 signals. In: Advancements in smart computing and information
140. Sapiński T, Kamińska D, Pelikant A, Anbarjafari G (2019) security: first international conference, ASCIS 2022, Rajkot,
Emotion recognition from skeletal movements. Entropy India, November 24–26, 2022, Revised Selected Papers, Part I,
21(7):646 pp 267–285. Springer
141. Saxena A, Khanna A, Gupta D (2020) Emotion recognition and 160. Vařeka L, Mautner P (2017) Stacked autoencoders for the p300
detection methods: a comprehensive survey. J Artif Intell Syst component detection. Front Neurosci 11:302
2(1):53–79 161. Varghese BA, Sandy L, Steven C, Amir T, Passant M, Daniel S,
Melissa P, Bhushan D, Duddalwar Vinay A, Larsen Linda H
(2022) Characterizing breast masses using an integrative
123
Neural Computing and Applications
framework of machine learning and CEUS-based radiomics. 173. Yin G, Sun S, Dian Yu, Li D, Zhang K (2022) A multimodal
J Ultrasound 25:1–10 framework for large-scale emotion recognition by fusing music
162. Varshney D, Ekbal A, Tiwari M, Nagaraja GP (2023) Emokb- and electrodermal activity signals. ACM Trans Multimed
gan: emotion controlled response generation using generative Comput Commun Appl (TOMM) 18(3):1–23
adversarial network for knowledge grounded conversation. PloS 174. Younis EMG, Zaki SM, Kanjo E, Houssein EH (2022) Evalu-
ONE 18(2):e0280458 ating ensemble learning methods for multi-modal emotion
163. Viola P, Jones MJ (2004) Robust real-time face detection. Int J recognition using sensor data fusion. Sensors 22(15):5611
Comput Vis 57(2):137–154 175. Zhang J, Yin Z, Chen P, Nichele S (2020) Emotion recognition
164. Wagh KP, Vasanth K (2022) Performance evaluation of multi- using multi-modal data and machine learning techniques: a
channel electroencephalogram signal (EEG) based time fre- tutorial and review. Inf Fusion 59:103–126
quency analysis for human emotion recognition. Biomed Signal 176. Zhang J, Zhou Y, Liu Y (2020b) Eeg-based emotion recognition
Process Control 78:103966 using an improved radial basis function neural network.
165. Walter Y, Altorfer A (2023) Electrodermal activity implicating J Ambient Intell Hum Comput 1–12
a sympathetic nervous system response under the perception of 177. Zhang T, Lin W, Vogelmann AM, Zhang M, Xie S, Qin Y,
sensing a divine presence-a psychophysiological analysis. Psych Golaz J-C (2021) Improving convection trigger functions in
5(1):102–112 deep convective parameterization schemes using machine
166. Wan-Hui W, Yu-Hui Q, Guang-Yuan L (2009) Electrocardio- learning. J Adv Model Earth Syst 13(5):1–19
graphy recording, feature extraction and classification for emo- 178. Zhang X-D (2020) A matrix algebra approach to artificial
tion recognition. In: 2009 WRI World congress on computer intelligence. Springer
science and information engineering, vol 4, pp 168–172. IEEE 179. Zhao H, Xiao Y, Zhang Z (2020) Robust semisupervised gen-
167. Wang Q, Wang M, Yang Y, Zhang X (2022) Multi-modal erative adversarial networks for speech emotion recognition via
emotion recognition using EEG and speech signals. Comput distribution smoothness. IEEE Access 8:106889–106900
Biol Med 149:105907 180. Zheng C, Wu G, Bao F, Cao Y, Li C, Zhu J (2023) Revisiting
168. Wang X, Guo Y, Ban J, Qing X, Bai C, Liu S (2020) Driver discriminative vs. generative classifiers: theory and implications.
emotion recognition of multiple-ECG feature fusion based on arXiv preprint arXiv:2302.02334
BP network and d-s evidence. IET Intell Transp Syst 181. Zheng X, Nguyen H (2022) A novel artificial intelligent model
14(8):815–824 for predicting water treatment efficiency of various biochar
169. Yan W, Wei S, Wei T, Antonio L, Dawei Y, Xinlei L, Shuyong systems based on artificial neural network and queuing search
G, Yixuan S, Weifeng G, Wei Z et al (2022) A systematic algorithm. Chemosphere 287:132251
review on affective computing: emotion models, databases, and 182. Zhu C, Idemudia CU, Feng W (2019) Improved logistic
recent advances. Inf Fusion 83:19 regression model for diabetes prediction by integrating PCA and
170. Min-Hao W, Chang T-C (2021) Evaluation of effect of music on k-means techniques. Inform Med Unlocked 17:100179
human nervous system by heart rate variability analysis using 183. Zhu L, Zhu Z, Zhang C, Yifei X, Kong X (2023) Multimodal
ecg sensor. Sens Mater 33:739–753 sentiment analysis based on fusion methods: a survey. Inf
171. Xu Y, Hübener I, Seipp A-K, Ohly S, David K (2017) From the Fusion 95:306–325
lab to the real-world: an investigation on the influence of human 184. Zounemat-Kermani M, Batelaan O, Fadaee M, Hinkelmann R
movement on emotion recognition using physiological signals. (2021) Ensemble machine learning paradigms in hydrology: a
In: 2017 IEEE international conference on pervasive computing review. J Hydrol 598:126266
and communications workshops (PerCom Workshops),
pp 345–350. IEEE Publisher’s Note Springer Nature remains neutral with regard to
172. Yang S, Yang G (2011) Emotion recognition of EMG based on jurisdictional claims in published maps and institutional affiliations.
improved LM BP neural network and SVM. J Softw
6(8):1529–1536
123