Speech_Emotion_Recognition_Using_Deep_Learning_Hybrid_Models
Speech_Emotion_Recognition_Using_Deep_Learning_Hybrid_Models
Abstract— Speech Emotion Recognition (SER) has been The readability of the generated text from audio and the
essential to Human-Computer Interaction (HCI) and other accuracy, clarity of the extracted expressed words are the main
complex speech processing systems over the past decade. Due to concerns of traditional speech information processing
the emotive differences between different speakers, SER is a systems[3]. In addition to the terms and information delivered,
complex and challenging process. The features retrieved from the speech signal also conveys the implicit emotional state of
speech signals are crucial to SER systems’ performance. It is still the speaker [4]. An excellent SER System that reflects the
challenging to develop efficient feature extracting and appropriate speaker’s emotions by separating acoustic
classification models. This study suggested hybrid deep learning
components is the foundation for a more efficient human-
models for accurately extracting crucial features and enhancing
computer interaction. SER systems are useful and have crucial
predictions with higher probabilities. Initially, the Mel
spectrogram’s temporal features are trained using a
scientific significance in health, machine interactions, and
combination of stacked Convolutional Neural Networks (CNN) other fields.
& Long-term short memory (LSTM). The said model performs Traditionally, hand-crafted and engineered characteristics,
well. For enhancing the speech, samples are initially such as signal energy, voice pitch, entropy, crossing rate, Mel-
preprocessed using data improvement and dataset balancing frequency cepstral coefficients (MFCC), and chroma-based
techniques. The RAVDNESS dataset is used in this study which [5-8], were used to create machine learning (ML) models for
contains 1440 samples of audio in North American English
speech emotion recognition (SER). Yet how well these
accent. The strength of the CNN algorithm is used for obtaining
models perform depends on the features included. Research is
spatial features and sequence encoding conversion, which
generates accuracy above 93.9% for the model on mentioned still being done to look into new features and algorithms to
data set when classifying emotions into one of eight categories. predict the dynamics of feature sequences reflecting human
The model is generalized using Additive white Gaussian noise emotions, even though it is unknown which characteristics
(AWGN) and Dropout techniques. most strongly connect with the different emotions, so that
model can easily predict. On the other hand, new deep
Keywords—Speech Emotion Recognition, CNN, SER, learning developments and the available processing capacity
Stacked CNN have enabled the scientific community to develop end-to-end
SER systems efficiently.
1. INTRODUCTION
These algorithms can quickly pick up information from
Voice signals, the most natural and practical form of spectrograms or unprocessed waveforms [9, 10], eliminating
human communication, include linguistic data like semantics the need to manually extract a huge number of features, which
language type and a wealth of non-linguistic data like facial is a significant advantage. CNN & LSTM models built on
expression, speech emotion, and so forth. SER [1] has become spectrograms and raw waveforms have been suggested in
increasingly important in recent years as artificial intelligence recent studies to increase SERs performance [11,12,13,14].
has continued to advance. Researchers are becoming more However, building such complex systems requires a large
interested in the study that demonstrates how computers can amount of classified training data. Additionally, more labeled
recognize people's emotions in speech. Speech contains training data may make the models more accurate.
various paralinguistic information, including emotion, which Furthermore, a lack of labeled training data may cause the
has made speech emotion recognition an appealing research models to be overfitted to particular data circumstances and
issue in many domains. domains, impairing generalization to other new data given for
There are many factors that make the recognition of testing.
emotion from speech signals very challenging. There have This study aims to use the RAVDNESS dataset to train
been efforts in the last decades, however there aren't many two hybrid models, obtain method accuracy and compare the
accurate and balanced speech emotion datasets [2]. SER model’s performances on slide change of training technique.
systems face many difficulties. First of all, it takes a lot of
time and work to create a high-quality speech emotion The following section of this study summarizes earlier
database. Second, the database contains a variety of data with research studies on SER and includes methodology, results,
diverse speakers, each of whom has a distinct gender, age, and conclusion.
language, culture, rhythm, tonality, etc. And finally, while
2. RELATED WORK
expressing emotions in speech, sentences are commonly
utilized rather than specific words. These elements are all Three essential steps of a conventional SER system are,
crucially significant for SER system.. preprocessing, feature extractions, and recognition or
classification. SER system’s accuracy depends upon the
Authorized licensed use limited to: Digital University Kerala. Downloaded on March 26,2025 at 06:15:09 UTC from IEEE Xplore. Restrictions apply.
performance of correct feature extractions and correct is better as compared to other models if audio signals have
classification. Hidden Markov-based SER techniques some noise that represents model as best in generalization
generate functions based on the probability of output using a aspect.
gaussian distribution which helps to maintain nonlinearity or
dynamic features proposed by Tin New [15]. 3. METHODOLOGY
However, because speech signals contain a variety of 3.1 Dataset
emotional states, the method necessitates numerous HMM, Emotional Database
which increases the training computation, and makes it
challenging to recognize emotions; however, overall accuracy RAVDESS is a database of audio and video with data on
becomes down. Herve et al. [16] developed a method for emotions. 24 professional audio utterances. The accent used
determining posterior probability utilizing large-scale in this database is North American. The fact that this dataset
computation to address the inaccuracy of Hidden Markov includes songs and basic speeches is an advantage. Songs
models by estimating all prior probabilities using processed include feelings of sadness, neutrality, fear, happiness, anger,
contingent posterior probabilities, which helps to increase surprise, and disgust, as do voices that display these emotions.
prior probabilities accuracies. LSTM is more beneficial for There were two emotional intensity levels (strong and normal)
extensive acoustic modeling. According to the authors [17], for each expression and also for neutral one. Ten times each
they made suggestions by simulating long-term dependent actor was given ratings for emotional validity, intensity, and
characteristics of audio sequences in different layers of sincerity for collecting 7356 recordings(1440 only audio files)
networks. Their created network model has a greater accuracy and 247 untrained study volunteers from North America
rate and produces an strong correlation with extracting supplied ratings.
meaningful information. Others [18] made an approach using The data set is divided into train, validation, and test sets
layer-based CNN for speech recognition. They have used the in the order listed: (80,10,10). The Standard Scaler is used to
reduction technique by adding an attention layer that finds scale dataset.
significant weights for time frames to learn quickly. The
attention layers help to focus on specific consequences, and so
it helps to increase the recognition rate. This makes sure that
crucial information doesn't get lost in speech. The technique
excels in the basic convolution neural network in terms of
object identification accuracy.
To remove the low accuracy issue, we have proposed a
technique using deep learning algorithms that works perfectly.
We have addressed different techniques that help the model
perform much better. Unlike traditional or NLP methods,
which first convert speech to text before training a model on
text, these approaches use textual data as the input for the
models. Compared to these techniques, CNN works much
better; we have produced a model by using a neural network.
The model is trained with Mel spectrograms of the audio. We
have preprocessed audios so that the abnormal condition could
easily be removed before giving input to the model that can
produce over or underfitting issues. Perfect spectrograms are Figure-1 Dataset Visualization
generated for input. But we have used advanced processing
techniques like adding noises so that model can work as a 3.2 Pre-Processing
generalized model. The short sample size of the dataset makes the denser
The model's architecture contains four stacked network containing various convolution blocks overfit on
convolutional blocks attached sequentially. Features learned training data and underfit on others. As a result, the data
from one block are used as input to the other; hence, till the augmentation mechanism is included in our design. However,
last block, only essential features are kept, producing perfect it would be challenging to produce more real samples. The
classification rates. Each Conv block has the same addition of significant noise will create learning issues, and
configuration. There are four layers in each block; the adding small noise components will not make the model more
convolution layer applies a filter to capture only specific generalized. We need a proper mix of noise. A channel model
information description of this layer is mentioned in the known as AWGN [21] assumes that the only interference with
methodology. Normalization tries to normalize features, and communication is the linear addition of broadband or white
the activation layer activates the output of the normalization noise with a constant spectral density (measured in watts per
layer using Relu function. Finally, pooling and drop-out layers hertz bandwidth) and Gaussian amplitude distribution. The
help to reduce dimensions and only keep features that can be model does not consider Fading, frequency selectivity,
sent forward, as the last block is LSTM, which helps to interference, nonlinearity, and dispersion.
memorize and, because of memorizing, current, past, and Signals are loaded at a 48 kHz sample rate and shut off
based on neighbors' features, the final decision can be taken between [0.5, 3] seconds. The signal is padded with zeros if it
with complete perfection. We have used RAVDNESS [19, 20, is less than 3 seconds long.
21] dataset and processed 1440 audios for model generation.
The main advantage of this created model is that classification The calculated MEL spectrogram is utilized as an input for
can be done in simple speeches and on audio songs. Accuracy the model, for the model spectrogram is split into seven
chunks.
Authorized licensed use limited to: Digital University Kerala. Downloaded on March 26,2025 at 06:15:09 UTC from IEEE Xplore. Restrictions apply.
3.3 Model
i) CNN – LSTM
The dataset is loaded in this model, then Mel Spectrogram is
calculated using Librosa python. MEL spectrum is then
divided into seven chunks as mentioned in preprocessing
(Mel Spectrogram is found using a hamming window with
width 512 and hop lengths 256). Mel Spectrogram, after
dividing into chunks was given as input for better learning of
2d CNN [22] with time distributed [23] layers (in a fashion
of stack) with four conv blocks used. Figure 2 represents
model architecture. The description of four sequential
2dConv blocks is as follows:
In the first block, channel one was used as an input, and
channel 16 for output with padding & stride one and kernel
size 3. In the first block, max pooling was done with a kernel
size of 2 and stride size of 2. And 16 channels/dimensions are
used from the last layer to the norm layer.
In the second block channels, 16 were used as input, and
channel 32 for output with padding & stride one and kernel
size 3 [24]. In the first block, max pooling [25] was done with
a kernel size of 4 and stride size of 4. And 32
channels/dimensions were used from the last layer to the
norm layer.
The other two blocks also have the same configuration as the
second block; only Batchnorm2d varies 64 for the third
and128 for the fourth block, and each block has a dropout of
the same rate of 0.4 except for the last LSTM [26] block.
After the above configuration, the model uses whatever is
found after flattening LSTM is combined with the Linear
SoftMax Layer [27,28] (which will help to recognize several
inputs that must be the same as hidden nodes in the LSTM
layer). And finally, the fully connected layers give the loss Figure 2: Flow Diagram of First Model
function/output and the activation function [29, 30], which
will provide the probability of the emotion (classification).
The next step in coding was training the model just created
and then validating and verifying the model. 4. RESULTS
Each convolution block has almost the same configuration as
mentioned. More critical is the mapping of layers inside The following are the model results, and it produces the best
convolution; the first layer is the convolution layer applies a results. There are 1147 audio clips for training (80% of the
filter on the input spectrogram, finds features, and sends it to dataset). As the dataset is balanced, it contains the same clips
the batch normalization layer. The BN layer normalizes the for all the categories. Results produced by the model are
output of the conv layer, and then the Activation layer represented in the following confusion matrix.
activates outcomes of the BN layer using the Relu activation
function. Finally, dimensions are reduced so that only
important features can be considered. The Max Pooling layer
helps to reduce dimensions. The last layer of the conv block
is the dropout layer which dropout neuron and send the most
important features obtained to the LSTM block. LSTM block
provides the advantage of keeping all the outputs of singular
images obtained from CNN block processing. LSTM helps to
remember the correct sequences of the classes and their
predictions.
Initially model could have performed better on testing data;
AWGN noise was added to remove overfitting because
results on training samples were accurate, while on testing
samples, results were not correctly predicted. The addition of
AWGN and the dropout of neurons helps to overcome this
issue.
Figure 3: Confusion Matrix of Model
Authorized licensed use limited to: Digital University Kerala. Downloaded on March 26,2025 at 06:15:09 UTC from IEEE Xplore. Restrictions apply.
As mentioned, loss before AWGN is more and after addition Classification matrices, Accuracy, Recall, and F1-Score
of AWGN reduces loss. The following figures (figure-4&5) of the model obtained from the confusion matrix are
show the loss results before and after adding AWGN. (Images mentioned below.
are directly taken from software- that’s why there are grids) Class Precision Recall F1 Score
Accuracy 93.4
Overall
Figure 4: Loss Before AWGN Table-2 Performance Matrices
6 Angry 100%
[1] R. Anusha, P. Subhashini, D. Jyothi, P. Harshitha, J. Sushma and N.
7 Fear 84.4% Fear, 10% Sad, Mukesh, "Speech Emotion Recognition using Machine Learning,"
3.5% Angry, 1.5% 2021 5th International Conference on Trends in Electronics and
Surprise Informatics (ICOEI), 2021, pp. 1608-1612, doi:
10.1109/ICOEI51242.2021.9453028.
8 Disgust 96.5% Disgust, 3.5% [2] Y. B. Singh and S. Goel, "Survey on Human Emotion Recognition:
Angry Speech Database, Features, and Classification," 2018 International
Conference on Advances in Computing, Communication Control and
Table-1 Confusion matrix Description
Authorized licensed use limited to: Digital University Kerala. Downloaded on March 26,2025 at 06:15:09 UTC from IEEE Xplore. Restrictions apply.
Networking (ICACCCN), 2018, pp. 298-301, doi: [18] Zhang Y Y,Du J,Wang Z R,et al. Attention-based recognition [ C
10.1109/ICACCCN.2018.8748379. ]//2018 Asia-Pacific Signal and Information Processing Association
[3] S. E. Bou-Ghazale and J. H. L. Hansen, "A comparative study of Annual Summit and Conference. Honolulu, USA: IEEE, 2018: 1771-
traditional and newly proposed features for recognition of speech under 1775. (in USA).
stress," in IEEE Transactions on Speech and Audio Processing, vol. 8, [19] A. U A and K. V K, "Speech Emotion Recognition-A Deep Learning
no. 4, pp. 429-442, July 2000, doi: 10.1109/89.848224. Approach," 2021 Fifth International Conference on I-SMAC (IoT in
[4] S. R. Kadiri and P. Alku, "Excitation Features of Speech for Speaker- Social, Mobile, Analytics, and Cloud) (I-SMAC), 2021, pp. 867-871,
Specific Emotion Detection," in IEEE Access, vol. 8, pp. 60382-60391, doi: 10.1109/I-SMAC52330.2021.9640995.
2020, doi: 10.1109/ACCESS.2020.2982954. [20] Livingstone, S. R., & Russo, F. A. (2018). The Ryerson Audio-Visual
[5] Y. Zhan and X. Yuan, "Audio post-processing detection and Database of Emotional Speech and Song (RAVDESS): A dynamic,
identification based on audio features," 2017 International Conference multimodal set of facial and vocal expressions in North American
English. PLOS ONE, 13(5), e0196391.
on Wavelet Analysis and Pattern Recognition (ICWAPR), 2017, pp.
154-158, doi: 10.1109/ICWAPR.2017.8076681. [21] S. S. Meher and T. Ananthakrishna, "Dynamic spectral subtraction on
AWGN speech," 2015 2nd International Conference on Signal
[6] Roberts, L. (2020, March 14). Understanding the mel spectrogram.
Processing and Integrated Networks (SPIN), 2015, pp. 92-97, doi:
Medium. Retrieved July 20, 2022, from https://medium.com/analytics-
10.1109/SPIN.2015.7095302.
vidhya/understanding-the-mel-spectrogram-fca2afa2ce53
[22] A. Mujaddidurrahman, F. Ernawan, A. Wibowo, E. A. Sarwoko, A.
[7] Prabakaran, D., & Sriuppili, S. (2021). Speech Processing: MFCC Sugiharto and M. D. R. Wahyudi, "Speech Emotion Recognition Using
Based Feature Extraction Techniques- An Investigation. Journal of 2D-CNN with Data Augmentation," 2021 International Conference on
Physics: Conference Series, 1717. Software Engineering & Computer Systems and 4th International
[8] Shah, Ayush & Kattel, Manasi & Nepal, Araju & Shrestha, D.. (2019). Conference on Computational Science and Information Management
Chroma Feature Extraction. (ICSECS-ICOCSIM), 2021, pp. 685-689, doi:
[9] Mengna Gao, Jing Dong, Dongsheng Zhou, Qiang Zhang, and Deyun 10.1109/ICSECS52883.2021.00130.
Yang. 2019.End-to-end speech emotion recognition is based on a one- [23] Eom, Y., & Bang, J. (2021). Speech Emotion Recognition Using 2D-
dima ensional convolutional neural network. In Proc. ACM ICIAI. 78– CNN with Mel-Frequency Cepstrum Coefficients. Journal of
82. Information and Communication Convergence Engineering , 19 (3),
[10] [Xi Ma, Zhiyong Wu, Jia Jia, Mingxing Xu, Helen Meng, and 148–154. https://doi.org/10.6109/JICCE.2021.19.3.148
Lianhong Cai. 2018. Emotion recognition from variable-length speech [24] Pandey, A. K. (2021, January 24). Convolution, padding, Stride, and
segments using deep learning on spectrograms. InProc. pooling in CNN. Medium. Retrieved July 20, 2022, from
INTERSPEECH. 3683–3687 [25] Phan, Huy, et al. “Robust Audio Event Recognition with 1-Max
[11] M. Neumann and N. T. Vu, "Improving Speech Emotion Recognition Pooling Convolutional Neural Networks.” Interspeech 2016, ISCA,
with Unsupervised Representation Learning on Unlabeled Speech," 2016, pp. 3653–57. DOI.org (Crossref),
ICASSP 2019 - 2019 IEEE International Conference on Acoustics, https://doi.org/10.21437/Interspeech.2016-123.
Speech and Signal Processing (ICASSP), 2019, pp. 7390-7394, doi:
[26] J. Oruh, S. Viriri and A. Adegun, "Long Short-Term Memory
10.1109/ICASSP.2019.8682541.
Recurrent Neural Network for Automatic Speech Recognition," in
[12] D. Eledath, P. Inbarajan, A. Biradar, S. Mahadeva and V. IEEE Access, vol. 10, pp. 30069-30079, 2022, doi:
Ramasubramanian, "End-to-end speech recognition from raw speech: 10.1109/ACCESS.2022.3159339.
Multi time-frequency resolution CNN architecture for efficient
[27] Passricha, Vishal and Aggarwal, Rajesh Kumar. "A Hybrid of Deep
representation learning," 2021 29th European Signal Processing
CNN and Bidirectional LSTM for Automatic Speech Recognition"
Conference (EUSIPCO), 2021, pp. 536-540, doi:
Journal of Intelligent Systems, vol. 29, no. 1, 2020, pp. 1261-1274.
10.23919/EUSIPCO54536.2021.9616171.
[28] James, Praveen & Mun, Hou & Vaithilingam, Chockalingam & Tan,
[13] Xi Ma, Zhiyong Wu, Jia Jia, Mingxing Xu, Helen Meng, and Lianhong Alan & Chiat, Wee. (2018). End to End Speech Recognition using
Cai. 2018. Emotion recognition from variable-length speech segments LSTM Networks for Electronic Devices. Journal of Advanced
[14] Seyedmahdad Mirsamadi, Emad Barsoum, and Cha Zhang. 2017. Research in Dynamical and Control Systems. 10. 933-939.
Automatic speech emotion recognition using recurrent neural networks [29] S. -X. Zhang, R. Zhao, C. Liu, J. Li and Y. Gong, "Recurrent support
with local attention.IEEE ICASSP. 2227–2231. vector machines for speech recognition," 2016 IEEE International
[15] Nwe T L, Foo S W, De Silva L C. Speech emotion recognition using Conference on Acoustics, Speech and Signal Processing (ICASSP),
hidden Markov models[J]. Speech communication, 2003, 41(4): 603- 2016, pp. 5885-5889, doi: 10.1109/ICASSP.2016.7472806.
623. (in NETHERLANDS). [30] Graves, Alex, et al. Speech Recognition with Deep Recurrent Neural
[16] Bourlard H,Konig Y, Morgan N, et al. A new training algorithm for Networks. arXiv:1303.5778, arXiv, 22 Mar. 2013. arXiv.org,
hybrid HMM/ ANN speech recognition systems[ C]/ / 1996 8th [31] Zhang, Yuanyuan, et al. “Attention Based Fully Convolutional
European Signal Processing Conference. Trieste, Italy: IEEE, 1996:1- Network for Speech Emotion Recognition.” 2018 Asia-Pacific Signal
4. (in Italy). and Information Processing Association Annual Summit and
[17] Kipyatkova I. LSTM-based language models for very large vocabulary Conference (APSIPA ASC), IEEE, 2018, pp. 1771–75. DOI.org
continuous Russian speech recognition [ M ]/ /Speech and Computer. (Crossref), https://doi.org/10.23919/APSIPA.2018.8659587
Cham: Springer International Publishing, 2019: 219-226. (in
Germany).
Authorized licensed use limited to: Digital University Kerala. Downloaded on March 26,2025 at 06:15:09 UTC from IEEE Xplore. Restrictions apply.