Book
Book
Submitted by
Similarity 6%
2127219_Navin George.pdf
1
Document 2127219_Navin George.pdf (D165988168)
18 Vismaya Edited.pdf
1
Document 18 Vismaya Edited.pdf (D110053447)
Dissertation_ENDTERM _2022_anjali.docx
1
Document Dissertation_ENDTERM _2022_anjali.docx (D139764746)
Report_updated.docx
3
Document Report_updated.docx (D164018692)
URL: https://pdfs.semanticscholar.org/fb17/7971a2f1c4f93728df3ad7b1e0e22b9dc19d.pdf
1
Fetched: 5/2/2022 10:47:57 AM
URL: https://www.researchgate.net/publication/350323362_A_Comprehensive_Review_of_Speech_Emotion_Re...
10
Fetched: 2/16/2023 11:27:31 AM
URL: https://www.researchgate.net/figure/Structure-of-Deep-Belief-Network-DBN_fig3_318668903
2
Fetched: 5/10/2021 1:00:25 AM
URL: https://www.researchgate.net/publication/257571892_Emotion_recognition_from_speech_A_review
1
Fetched: 9/27/2019 10:07:16 AM
23.pdf
1
Document 23.pdf (D169944551)
URL: https://extrudesign.com/speech-emotion-recognition/
3
Fetched: 6/25/2021 4:08:46 AM
Final Report.pdf
1
Document Final Report.pdf (D169071538)
2127242_Neha Vinod_CapstoneReport.pdf
1
Document 2127242_Neha Vinod_CapstoneReport.pdf (D165988167)
URL: https://ieeexplore.ieee.org/document/8282315
1
Fetched: 7/11/2023 8:37:00 PM
URL: https://www.researchgate.net/publication/330344515_Deep_Learning_Models_for_Speech_Emotion_Rec...
1
Fetched: 11/26/2019 8:10:41 AM
11 Meenu S Nair.pdf
1
Document 11 Meenu S Nair.pdf (D109957365)
URL: https://ieeexplore.ieee.org/document/8567303
1
Fetched: 7/11/2023 8:37:00 PM
URL: https://ieeexplore.ieee.org/document/8918885
1
Fetched: 7/11/2023 8:37:00 PM
Final Synopsis.docx
1
Document Final Synopsis.docx (D170989324)
URL: https://ieeexplore.ieee.org/document/9350913
1
Fetched: 7/11/2023 8:37:00 PM
URL: https://ieeexplore.ieee.org/document/10141044
1
Fetched: 7/11/2023 8:37:00 PM
Entire Document
Speech Emotion Recognition A thesis submitted in partial fulfillment of the requirements for the award of the degree of Master of
Science in
Computer Science
by
Yogender Kumar (21419CMP036)
Speech Signal
Speech Processing Module Feature Extraction Feature Selection
Classification
Emotion Recognized
Database of Audio File
Testing data & Training Data
Feature Selection
Feature Engineering
Feature Subset
Model Classification
Evaluation of Result
Department of Computer Science Institute of Science Banaras Hindu University, Varanasi – 221005 July 2023
CANDIDATE’S DECLARATION
I Yogender Kumar hereby certify that the work, which is being presented in the thesis/report,
entitled
Speech Emotion Recognition,
in partial fulfillment of the requirement for the award of the Degree of Master of Science in
Computer Science and submitted to the institution is an authentic record of my/our own work carried
out
during the period March-2023 to July-2023 under the supervision of Dr. Vandana Kushwaha I also cited the reference about the
text(s) /figure(s) /table(s) /equation(s) from where they have been taken.
61% MATCHING BLOCK 3/40 Dissertation_ENDTERM _2022_anjali.docx (D139764746)
The matter presented in this thesis as not been submitted elsewhere for the award of any other degree or diploma from any
Institutions. Date: Signature of the Candidate This is to certify that the above statement made by the candidate is correct to the best
of my/our knowledge.
The Viva-Voce examination of Yogender Kumar, M.Sc. Student has been held on _________________.
Signature of Signature of Research Supervisor Head of the Department
ABSTRACT
Human-Computer Interaction (HCI) includes vital but difficult elements like emotion recognition from speech signals. Numerous
techniques, including various established methods for speech analysis and classification, have been employed in the field of speech
emotion recognition (SER) to extract emotional information from signals. Speech Emotion Recognition (SER) is a field of study that
focuses on developing techniques and algorithms to automatically recognize and interpret emotions conveyed through speech
signals. Emotions play a crucial role in human communication, and accurately identifying them has various applications in areas such
as human-computer interaction, virtual agents, and mental health monitoring. This report aims to explore the challenges and
advancements in SER by reviewing relevant literature and implementing a practical framework for emotion recognition from speech
data. The study begins by reviewing the fundamental concepts of emotion recognition and the role of speech as a medium for
emotional expression. Various signal processing techniques, such as feature extraction and dimensionality reduction, are examined in
the context of SER. Various deep learning models, are investigated for their effectiveness in recognizing emotions from speech. The
results obtained from the implementation are analyzed and compared with existing approaches to evaluate the performance and
effectiveness of the proposed framework. The findings of this research contribute to the broader field of SER and provide insights into
improving the accuracy and efficiency of emotion recognition systems for real-world applications.
Keywords: Speech Emotion recognition, processing, learning, deep learning, feature Extraction, model, Sentiment, Human interaction,
extraction.
TABLE OF CONTENTS Title Page No. ABSTRACT v LIST OF TABLES vi LIST OF FIGURES vii LIST OF
ABBEREVIATIONS viii
CHAPTER 1 INTRODUCTION 1.1 General 1 1.2 Problem Statement 2 1.3 Objectives 3 1.4 Scope of the report 5
CHAPTER 2 LITERATURE REVIEW 2.1 Overview of Speech Emotion Recognition 6 2.2 Historical Development 7 2.3 Theoretical
Framework and Models 8 2.4 Feature Extraction Techniques 10 2.5 Emotion Dataset 11 2.6 Related Works 13
CHAPTER 3 METHODOLOGY 3.1 Introduction 18 3.2 Methodology 19 3.3 Classification Algorithm or Model Selection 22 3.4 Proposed
System Architecture 23
CHAPTER 4 IMPLEMENTATION 4.1 Introduction 26 4.2 Dataset Description 27 4.3 Software and Hardware Requirement 29 4.4 Data
Preprocessing and Feature Extraction & Selection 31 4.5 Speech Emotion Recognition Techniques 34
CHAPTER 5 RESULTS AND DISCUSSION 5.1 Introduction 35 5.2 Evaluation Metrics 36 5.3 Results 37 5.4 Comparative Analysis of
Different Techniques 39 5.5 Limitations and Challenges 40
CHAPTER 6 CONCLUSION AND FUTURE WORK 6.1 Conclusion 42 6.2 Scope for Future Work 43
REFERENCES 44
PLAGIARISM REPORT 49
LIST OF TABLES
Table No. Title Page No.
1. Dataset Feature Emotion used in SER 12 2. Dataset information Comparison table 27 3. Packages used in Speech Emotion
Recognition System 30 4. Libraries used in Speech Emotion Recognition System 30 5. List of features present in an audio signal 33 6.
Confusion Matrix 36 7. Evaluation Metrics Comparison for Speech Emotion Recognition Models 39
LIST OF FIGURES
Figure No. Title Page No.
1. Traditional Speech Emotion Recognition System 2 2. Speech emotion recognition system block diagram 9 3. Graph showing
Training and Validation accuracy 19 4. Process in speech emotion recognition system 20 5. Methodology of Speech emotion
Recognition 21 6. CNN Algorithm 22 7. Speech Feature Classification 25 8. Hierarchy of a speech emotion recognition system 32 9.
Graph Showing Comparison of evaluation Metrics 40
LIST OF ABBREVIATIONS
Abbreviation Description
CNN Convolutional Neural Network LSTM Long Short-Term Memory RNN Recurrent Neural Network SAVEE Surrey Audio-Visual
Expressed Emotion
100% MATCHING BLOCK 6/40
TESS Toronto Emotional Speech Set RAVDESS Ryerson Audio-Visual Database of Emotional Speech and Song
IEEE Institute of Electrical and Electronics Engineers ACM Association for Computing Machinery TDES Triple Data Encryption Standard
MFCC Mel-Frequency Cepstral Coefficients HMM Hidden Markov Model SVM Support Vector Machine DCT Discrete Cosine Transform
SDD Solid-State Drive PCA Principal Component Analysis HDD Hard Disk Drive RAM Random Access Memory VRAM Video Random
Access Memory GPU Graphics Processing Unit AMD Advanced Micro Devices
CHAPTER 1 INTRODUCTION
1.1 GENERAL
Speech emotion recognition using deep learning is an emerging field that leverages advanced neural network architectures, such as
Convolutional Neural Networks (CNNs), Recurrent Neural Network (RNN) [1], and Long Short-Term Memory (LSTM)
networks [2], to automatically detect and classify emotions expressed in speech signals. The ability to accurately recognize and
interpret emotions in speech has significant implications for various applications, including human-computer interaction, affective
computing, and psychological research [1].
Traditional methods of speech emotion recognition often relied on handcrafted features and shallow learning algorithms, which
limited their ability to capture the complex patterns inherent in speech data. Deep learning techniques, with their ability to
automatically learn hierarchical representations from raw data, have revolutionized the field by enabling the extraction of highly
discriminative features directly from speech signals [1].
This report aims to explore the application of deep learning, specifically CNN [1], RNN, LSTM networks [2], for speech emotion
recognition. By utilizing these architectures, we aim to develop a robust and efficient system capable of accurately identifying and
classifying different emotional states conveyed in speech. In this report, we will delve into the theoretical foundations of deep learning
and its relevance to speech emotion recognition. We will review existing literature to gain insights into the historical development,
theoretical frameworks, and models utilized in this field [2]. Additionally, we will explore feature extraction techniques specifically
tailored for speech emotion recognition. The proposed approach section will outline our methodology, including the implementation
of CNN, RNN, LSTM networks, for speech emotion recognition. We will discuss the feature extraction and selection process, the
choice of classification algorithms or models, and the proposed system architecture that integrates these components.
Furthermore, we will provide a comprehensive overview of the implementation details, including dataset description, software, and
hardware requirements. We will explain the preprocessing steps and feature extraction techniques employed to transform raw speech
signals into suitable inputs for the deep learning models. The results and discussion section will present the evaluation metrics and
results obtained from our work. We will conduct a comparative analysis of different deep learning techniques, including CNN [1], RNN,
and LSTM [2], and discuss their performance in speech emotion recognition tasks. Additionally, we will address the limitations and
challenges associated with our approach and provide a detailed discussion of the findings. Finally, in the conclusion and future work
section, we will summarize the key findings of our study, highlight the contributions of our research, and discuss potential avenues for
future work in improving speech emotion recognition using deep learning.
Figure 1: Traditional Speech Emotion Recognition System
1.2 PROBLEM STATEMENT Speech emotion recognition is a challenging task due to the complex nature of emotions and the inherent
variability in speech signals. Traditional approaches to emotion recognition often relied on handcrafted features, which required
domain expertise and may not fully capture the intricate patterns within the data [3]. Moreover, deep learning algorithms may struggle
to learn high-level representations that can effectively discriminate between different emotional states.
The specific challenges we seek to address include:
Variability in Speech Data:
Speech signals exhibit significant variability in terms of acoustic characteristics, speaking styles, and individual differences. Recognizing
emotions from speech requires capturing and understanding these subtle variations, which poses a challenge for traditional methods
[3].
1.2.2 Complex and Dynamic Emotional States:
Emotions are complex, multidimensional states that can evolve dynamically within a speech segment. Deep learning techniques, with
their ability to model temporal dependencies and capture hierarchical representations, have the potential to effectively capture the
dynamic nature of emotional expressions [1,3].
1.2.3 Limited Labeled Training Data:
Acquiring large-scale labeled datasets for speech emotion recognition is a challenging and time-consuming process. Limited
availability of labeled data can hinder the training and generalization capabilities of deep learning models. We aim to explore strategies
to address this issue, such as data augmentation and transfer learning [3].
1.3 OBJECTIVES
The primary objective of this project/report is to explore the application of deep learning techniques, specifically
75% MATCHING BLOCK 8/40
Convolutional Neural Networks (CNNs), Recurrent Neural Network (RNN) and Long Short-Term Memory (LSTM)
Convolutional Neural Networks (CNNs), Recurrent Neural Network (RNN) [5], and Long Short-Term Memory (LSTM)
networks. The report will cover various aspects related to the development and evaluation of deep learning models for accurately
detecting and classifying emotions expressed in speech signals.
The report will primarily explore the theoretical foundations of deep learning and its relevance to speech emotion recognition. It will
provide a comprehensive review of existing literature, encompassing the historical development, theoretical frameworks, models,
feature extraction techniques, emotion datasets, and existing speech emotion recognition systems. This review will serve as a
foundation for understanding the current state-of-the-art and identifying research gaps and challenges.
In terms of the proposed approach, the report will detail the methodology for speech emotion recognition, with a specific focus on
the integration of CNNs, RNN and LSTM networks. It will cover the preprocessing steps, feature extraction techniques, and the design
and training of deep learning models for accurate emotion classification. The implementation section will provide practical details
regarding dataset selection and description, software and hardware requirements, preprocessing techniques, and the training process
of the CNN, RNN and LSTM models [1,2,3].
The evaluation and comparative analysis section will assess the performance of the deep learning models for speech emotion
recognition. It will include
to measure the effectiveness of the models. The comparative analysis will compare the performance of different deep learning
techniques, specifically CNN RNN, and LSTM networks, providing insights into their strengths and limitations [6].
The report will also discuss the limitations and challenges associated with speech emotion recognition using deep learning. It will
address issues such as limited training data, overfitting, generalization capabilities, and potential biases. Additionally, the report will
provide recommendations and potential directions for future research and improvement in the field [7].
CHAPTER 2 LITERATURE REVIEW
2.1 OVERVIEW OF SPEECH EMOTION RECOGNITION Speech emotion recognition is a multidisciplinary field that focuses on the
automatic detection and classification of emotions expressed in speech signals. It aims to develop computational models and systems
that can recognize and interpret emotional states conveyed through spoken language. The ability to accurately identify emotions from
speech has important applications in various domains, including human-computer interaction, affective computing, and psychological
research.
In this section, we provide an overview of speech emotion recognition, highlighting its significance and the challenges involved. We
explore the characteristics and dynamics of emotions as expressed through speech, as well as the potential cues and features that can
be extracted to capture emotional information. One of the primary challenges in speech emotion recognition lies in the variability and
complexity of emotional expressions [5]. Emotions can manifest in a wide range of vocal cues, including changes in pitch, intensity,
rhythm, and spectral content. Additionally, contextual factors and individual differences further contribute to the complexity of the
problem.
To address these challenges, researchers have explored various approaches and techniques for speech emotion recognition. These
range from traditional machine learning methods to more recent advancements in deep learning. Deep learning techniques, such as
Convolutional Neural Networks (CNNs), Recurrent Neural Network (RNN) and Long Short-Term Memory (LSTM)
networks, have shown promising results in capturing complex patterns and temporal dependencies present in speech data [1,2,8].
Furthermore, speech emotion recognition is closely related to other disciplines, including signal processing, linguistics, psychology,
and affective computing. Integrating knowledge and techniques from these fields enables a more holistic understanding of emotional
communication through speech and aids in the development of more robust recognition systems. In this report, we aim to explore
and contribute to the field of speech emotion recognition using deep learning techniques, specifically CNNs, RNN and LSTM
networks. By leveraging the power of deep learning, we seek to develop a system that can effectively analyze and classify emotional
states expressed in speech signals. Through our research, we aim to advance the current state-of-the-art in speech emotion
recognition and contribute to the development of more accurate and reliable system [1,2].
2.2 HISTORICAL DEVELOPMENT
The field of speech emotion recognition has witnessed significant advancements over the years, driven by technological
advancements and an increasing interest in understanding and analyzing human emotions in speech. This section provides a historical
overview of the development of speech emotion recognition, highlighting key milestones and influential studies. The exploration of
emotions in speech dates back to early research in the fields of psychology and linguistics, where scholars recognized the importance
of vocal cues and prosody in conveying emotional information. Early studies focused on manual annotation and analysis of speech
recordings to identify emotional states based on subjective judgments [6]. With the emergence of computer-based analysis,
researchers began to explore automated approaches for speech emotion recognition. Early methods often relied on handcrafted
features derived from acoustic properties, such as pitch, intensity, and spectral characteristics. These features were then used in
conjunction with classical machine learning algorithms, such as Support Vector Machines (SVM) and Hidden Markov Models (HMM), to
classify emotions [9].
As technology progressed, researchers started incorporating more advanced signal processing techniques to capture and extract
relevant features from speech signals. This led to the development of feature extraction methods such as Mel-frequency Cepstral
Coefficients (MFCC), Perceptual Linear Prediction (PLP), and various prosodic features.
In recent years, deep learning techniques, particularly Convolutional Neural Networks (CNNs), Recurrent Neural Network (RNN) and
Long Short-Term Memory (LSTM)
networks [1,2], have revolutionized the field of speech emotion recognition. These deep learning models have demonstrated superior
performance in capturing complex patterns and temporal dependencies within speech signals, enabling more accurate and robust
emotion classification. The availability of large-scale emotion datasets, such as
the Berlin Emotional Speech Database (EmoDB) [10] and the Interactive Emotional Dyadic Motion Capture (
IEMOCAP) dataset [10], has played a significant role in advancing the field. These datasets provide valuable resources for training and
evaluating speech emotion recognition models, enabling researchers to develop more data-driven and reliable systems [10].
In summary, the historical development of speech emotion recognition has seen a progression from manual annotation and
subjective judgments to automated approaches using handcrafted features and classical machine learning algorithms. With
advancements in signal processing and the introduction of deep learning techniques, the field has witnessed significant improvements
in accuracy and performance.
2.3 THEORETICAL FRAMEWORK AND MODELS Speech emotion recognition relies on a solid theoretical framework and models to
capture and interpret emotional information present in speech signals. This section explores the theoretical underpinnings and models
that form the basis for understanding and analyzing emotions in speech. Emotion theories provide the foundation for understanding
how emotions are expressed and perceived in speech. One prominent theoretical framework is the basic emotion theory, which
suggests that emotions can be categorized into a set of basic or primary emotions, such as happiness, sadness, anger, fear, and disgust
[9]. This framework serves as a starting point for the development of emotion recognition models that aim to classify speech signals
into these basic emotion categories.
In addition to the basic emotion theory, dimensional models of emotions have also been proposed. These models consider emotions
as continuous dimensions defined by valence (positive/negative) and arousal (activation level). They provide a more nuanced
representation of emotional states, allowing for a finer-grained classification and analysis of speech signals. The circumplex model, in
particular, represents emotions as points within a two-dimensional space, with valence and arousal as the axes. The figure depicting
the SER block diagram in Figure 2.
Figure 2: Speech emotion recognition system block diagram.[11]
To translate these theoretical frameworks into practical models for speech emotion recognition, various machine learning and deep
learning approaches have been employed. Classical machine learning algorithms, such as Support Vector Machines (SVM), Hidden
Markov Models (HMM), and Gaussian Mixture Models (GMM), have been widely used in the past [11]. These models often require
handcrafted features and rely on statistical modeling techniques to classify emotions. More recently, deep learning models have
shown remarkable success in capturing complex patterns and temporal dependencies in speech data. Convolutional Neural Networks
(CNNs) [2] have proven effective in extracting hierarchical and spatial representations from spectrograms or other time-frequency
representations of speech signals. Long Short-Term Memory (LSTM) networks, on the other hand, excel at modeling sequential
dependencies and capturing long-term temporal dynamics in speech. The combination of CNNs and LSTMs, such as the
Convolutional Recurrent Neural Network (CRNN) [9], has gained significant attention in speech emotion recognition. These hybrid
models leverage the strengths of both CNNs and LSTMs to capture both local and global contextual information in speech signals [11].
2.4 FEATURE EXTRACTION TECHNIQUES
Feature extraction plays a crucial role in speech emotion recognition, as it aims to capture the relevant information from speech
signals that can discriminate between different emotional states. This section explores various feature extraction techniques employed
in the field and their significance in enhancing the performance of emotion recognition systems.
Traditionally, handcrafted features based on acoustic properties, prosody, and spectral characteristics have been widely used for
speech emotion recognition. These features include fundamental frequency (F0), energy, Mel-frequency cepstral coefficients (MFCCs)
[12], and their derivatives. These features can capture information related to pitch, loudness, spectral shape, and temporal dynamics,
which are important cues for conveying emotions in speech.
In addition to traditional handcrafted features, recent advancements in deep learning have led to the exploration of learned features
directly from raw speech signals. Deep learning-based feature extraction methods, such as Deep Belief Networks (DBNs) [13],
Autoencoders, and Convolutional Neural Networks (CNNs), have shown promising results in capturing more discriminative and
abstract representations from speech data. CNNs, originally developed for image analysis, have been adapted to process speech
signals by considering spectrograms or other time-frequency representations as two-dimensional images. The convolutional layers of
CNNs can learn hierarchical and local representations, capturing both low-level and high-level acoustic features [14]. These learned
features have demonstrated improved performance in discriminating different emotional states. Another approach for feature
extraction is based on recurrent neural networks (RNNs) [4], particularly Long Short-Term Memory (LSTM) networks [2]. LSTMs are
capable of capturing long-term temporal dependencies and modeling sequential patterns present in speech signals. By utilizing
LSTMs, emotional information can be effectively encoded into the learned features, leading to enhanced emotion recognition
performance.
Furthermore, deep learning-based feature extraction methods have the potential to automatically learn relevant representations from
speech data without relying on explicit handcrafted features. This data-driven approach has the advantage of capturing subtle and
complex patterns in speech that may be difficult to extract using traditional feature extraction techniques. In this report, we will
explore both traditional handcrafted features and deep learning-based feature extraction techniques for speech emotion recognition
[13]. We aim to compare the performance of these different feature extraction methods, specifically examining the effectiveness of
learned features from deep learning models, such as CNNs and LSTMs, in capturing emotional information.
2.5 EMOTION DATASET AND CORPORA
Accurate training and evaluation of speech emotion recognition systems heavily rely on the availability of suitable datasets and
corpora. This section focuses on the importance of emotion datasets and corpora in advancing the field of speech emotion
recognition and explores some prominent examples used in research.
Emotion datasets and corpora provide researchers with labeled speech samples representing various emotional states. These
resources are essential for training and testing machine learning models and evaluating their performance in recognizing and
classifying emotions in speech. The design and composition of emotion datasets vary depending on the specific goals of the research.
Some datasets focus on a limited number
95% MATCHING BLOCK 15/40 pre submission tarnjeet first draft.docx (D170931857)
to enhance the accuracy and effectiveness of speech emotion recognition systems [1].
published in 2019, the researchers introduced a novel model called Deep Stride CNN architecture, abbreviated as DSCCN. The model
was applied and tested on two datasets, namely IEMOCAP and RAVDESS. The study highlighted the use of spectrograms generated
from enhanced speech signals in the proposed model, leading to improved accuracy and reduced computational complexity.
Specifically, the accuracy of the IEMOCAP dataset increased to 81.75%, while the accuracy of the RAVDESS dataset increased to 79.5%.
Additionally, the utilization of DSCCN allowed for dataset size reduction.
The findings of the study underscored the efficacy of the Deep Stride CNN architecture in enhancing audio signal processing for
speech emotion recognition. By leveraging spectrograms derived from enhanced speech signals, the proposed model achieved
higher accuracy rates while mitigating computational complexity, thereby contributing to more efficient and accurate speech emotion
recognition systems [2].
In the research article titled "Emotion Recognition of EEG Signals Based on the Ensemble Learning Method: AdaBoost" published in
2021, the authors propose a method for emotion recognition using EEG (ElectroEncephaloGraph) signals. The proposed approach
utilizes the ensemble learning method called AdaBoost. The study incorporates different domains, such as time and time-frequency,
to capture diverse aspects of the EEG signals. Non-linear features related to emotions are extracted from pre-processed EEG signals.
These extracted features are then fused in an eigenvector matrix. To reduce the dimensionality of the features, a linear discriminant
analysis feature selection method is employed. The proposed method is evaluated on the DEAP dataset, and the results demonstrate
its effectiveness in recognizing emotions. The method achieves a remarkable average accuracy rate of up to 98.70%.
This research contributes to the field of emotion recognition by leveraging EEG signals and applying the ensemble learning method of
AdaBoost. By considering multiple domains and extracting non-linear features, the proposed method offers a robust approach to
accurately identify emotions from EEG signals. The utilization of feature fusion and dimensionality reduction techniques further
enhances the performance of the system [3].
In the research paper titled "Speech Emotion Recognition based on SVM and ANN" published in 2018, the authors investigate the use
of SVM (Support Vector Machine) and ANN (Artificial Neural Network) models for speech emotion recognition. The paper focuses on
two types of features: acoustic features and statistical features. These features are calculated using an emotional model constructed
from SVM and ANN. The authors utilize the CASIA Chinese emotional corpus to analyze the key technologies of speech and assess the
impact of feature reduction on speech emotion recognition. To reduce the dimensionality of the features, the authors employ
techniques such as Principal Component Analysis (PCA). By applying PCA, the dimension of the feature space is reduced, leading to a
more compact representation of the data.
The study compares the performance of SVM and ANN models with and without PCA. The results indicate that the SVM model
achieves higher accuracy, with accuracy rates of 46.67% and 76.67% when PCA is applied. In contrast, the ANN-based model
demonstrates relatively lower accuracy. Furthermore, improvements are observed when conducting feature dimension reduction,
indicating the effectiveness of reducing feature space dimensionality for enhancing speech emotion recognition [4].
The research paper titled "Speech Emotion Recognition using Fourier Parameters" published in 2015 focuses on utilizing Fourier
parameters for speech emotion recognition. The authors conducted experiments using datasets such as CASIA, EMO-DB, and EESDB.
In this study, the Fourier features were extracted as continuous parameters from the speech signals. The authors employed a machine
learning technique, specifically an SVM classifier with a Gaussian radial basis function kernel, to classify and recognize emotions based
on these Fourier features.
The proposed Fourier parameter (FP) features demonstrated significant improvements in speaker-independent emotion recognition.
The results showed an increase of 16.2 points on the EMO-DB dataset, 6.8 points on the CASIA dataset, and 166 points on the LESS
database. To further enhance the performance, the combination of FP features and MFCC (Mel-frequency cepstral coefficients)
features yielded additional improvements. By incorporating both FP and MFCC features, the system achieved state-of-the-art
performance with approximately 17.5 points, 10 points, and 10.5 points improvement on the aforementioned databases, respectively.
This research highlights the effectiveness of Fourier parameters for speech emotion recognition. The proposed approach utilizing SVM
with Gaussian radial basis function kernel and the combination of FP and MFCC features showcases the potential for achieving high
accuracy in recognizing emotions from speech signals [5].
The paper explores the use of Mel-frequency cepstral coefficients (MFCC) for speech-based human emotion recognition. The authors
highlight that facial expression-based emotion classification contributes to improved fluency, accuracy, and genuineness in human-
computer interaction. This classification approach proves valuable in interpreting and enhancing the interaction between humans and
computers.
The paper emphasizes the strong relationship between feature modeling and classification accuracy. The advantages of the proposed
method include the detection of faces based on the relationships between neighboring regions. For video-based recognition, both
spatial and temporal features are combined into a unified model, providing a comprehensive approach.
However, it is important to note that the paper mentions a disadvantage associated with the method. Some features may be selected
multiple times, leading to redundancy, while others may have negligible impact on the overall classification performance. Overall, the
research presented in this paper focuses on utilizing MFCC for speech-based human emotion recognition. It highlights the advantages
of facial expression-based classification and the importance of feature modeling for achieving accurate results. Additionally, the
authors acknowledge a potential disadvantage related to feature selection and redundancy [6].
CHAPTER 3 METHODOLOGY
3.1 INTRODUCTION In this section, we present our proposed approach for speech emotion recognition using deep learning
techniques, specifically
Convolutional Neural Networks (CNNs), Recurrent Neural Network (RNN) and Long Short-Term Memory (LSTM)
networks. Our objective is to leverage the capabilities of these models to automatically extract meaningful features and capture
temporal dependencies from raw speech data, enabling accurate and robust emotion classification [1,2,4].
The proposed approach consists of several key steps, such as, in data preprocessing, the raw speech signals are preprocessed to
enhance their quality and remove any noise or artifacts that may interfere with the emotion recognition process. Common
preprocessing techniques include noise reduction, normalization, and segmentation. Feature Extraction, relevant features are
extracted from the preprocessed speech signals to represent the emotional content. Spectrogram-based features, capturing spectral
information over time, are utilized along with temporal features derived from the waveform. This combination allows for a
comprehensive representation of the emotional characteristics in speech. Model Architecture, design a hybrid architecture that
combines CNNs and LSTMs to learn discriminative representations and capture temporal dependencies. The CNN layers extract high-
level spectral features from the spectrogram, while the LSTM layers capture the sequential patterns and long-term dependencies
within the temporal features. Training and Optimization, the proposed model is trained using suitable optimization algorithms such as
stochastic gradient descent (SGD) or Adam, minimizing the classification error. Regularization techniques like dropout and batch
normalization are employed to prevent overfitting and improve generalization.
Model Evaluation, the trained model is evaluated on a separate validation set to assess its performance in recognizing emotions from
speech.
Convolutional Neural Networks (CNNs), Recurrent Neural Network (RNN) and Long Short-Term Memory (LSTM)
networks, both of which are powerful deep learning architectures commonly used for sequence-based tasks [17].
The CNN layers in your model will capture spatial dependencies in the speech features, enabling the network to learn important
patterns and representations. On the other hand, the LSTM layers will focus on capturing the temporal dynamics and long-term
dependencies present in the speech data. By combining these architectures, your model can effectively learn discriminative
representations and sequential patterns associated with different emotions.
Evaluation of Results: In the final step, you evaluate the performance of the trained models using the testing data. This evaluation
involves applying the trained models to the testing dataset and measuring their performance
using appropriate evaluation metrics, such as accuracy, precision, recall, or F1-score. The
evaluation results provide insights into the effectiveness of the selected features and the classification models in recognizing speech
emotions [19].
Figure 5: Methodology of Speech emotion Recognition 3.3 CLASSIFICATION ALGORITHM OR MODEL SELECTION
In our proposed approach for speech emotion recognition using deep learning, the selection of an appropriate classification algorithm
or model is crucial to achieve accurate and reliable results. After careful consideration, we have chosen to utilize
Convolutional Neural Networks (CNNs), Recurrent Neural Network (RNN) [4] and Long Short-Term Memory (LSTM)
networks as the primary models. These models have proven to be highly effective in capturing relevant features and temporal
dependencies within speech signals.
Convolutional Neural Networks (CNNs):
CNNs are deep learning architectures that excel at processing structured grid-like data, such as images and, in our case, spectrograms
of speech signals. CNNs consist of convolutional layers that apply filters to extract spatial features from the input data. These filters,
through the convolution operation, capture patterns and local dependencies, enabling the network to learn meaningful
representations. In the context of speech emotion recognition [7], CNNs can effectively capture acoustic features from spectrograms,
such as frequency distributions and temporal patterns, which are important indicators of different emotional states. The hierarchical
nature of CNNs allows them to learn increasingly complex features, making them well-suited for understanding and discriminating
between various emotional nuances in speech [1], refer to Figure 6.
Figure 6: CNN Algorithm Recurrent Neural Network (RNN):
Recurrent Neural Networks (RNNs) have been widely used in the context of Speech Emotion Recognition (SER) using deep learning
techniques. RNNs are particularly suitable for modeling sequential data, making them effective in capturing temporal dependencies
and patterns in speech signals [2].
Long Short-Term Memory (LSTM) Networks:
LSTM networks are a type of recurrent neural network (RNN) that have been designed to model sequential data with long-term
dependencies. Unlike traditional RNNs, LSTM networks have memory cells that can selectively retain or forget information over time,
allowing them to capture and retain relevant contextual information from past inputs. In the case of speech emotion recognition,
LSTM networks are particularly effective at capturing the temporal dynamics and long-term dependencies present in speech signals.
They can model the variations and changes in emotional expressions over time, helping to uncover meaningful patterns and dynamics
related to different emotions. The memory cells of LSTM networks enable them to retain relevant information across longer time
intervals, making them well-suited for capturing the temporal evolution of emotions in speech [4,18].
By combining the spatial feature extraction capabilities of CNNs with the temporal modeling abilities of LSTM networks, our proposed
approach can effectively capture both local acoustic characteristics and long-term temporal dependencies within speech signals. This
comprehensive understanding of the emotional content allows for more accurate and robust classification of emotions.
3.4 PROPOSED SYSTEM ARCHITECTURE
In this section, we present the proposed system architecture for speech emotion recognition using deep learning. The architecture
outlines the overall framework and integration of the selected models, feature extraction techniques, and classification algorithms.
The goal of this architecture is to effectively capture and classify emotional information from speech signals.
The proposed system architecture consists of the following key components: Input Data: The input data for the system consists of
speech signals, which can be obtained from audio recordings or preprocessed representations such as spectrograms or Mel-
frequency cepstral coefficients (MFCCs) [12]. These speech signals serve as the primary input to the system and contain the emotional
content that needs to be recognized.
Feature Extraction: In this component, various feature extraction techniques are applied to extract relevant acoustic and linguistic
features from the input speech signals. These features capture important characteristics such as pitch, energy, spectral information,
prosody, and linguistic cues that contribute to the expression of emotions in speech. Common feature extraction techniques used in
speech emotion recognition include filter banks, Fourier transforms, wavelet transforms, and other signal processing methods, refer to
Figure 7.
Deep Learning Models: Deep learning models, specifically Convolutional Neural Networks (CNNs), Recurrent Neural Networks
(RNNs) and Long Short-Term Memory (LSTM)
networks, are employed for the classification of speech emotions. CNNs are effective in capturing spatial dependencies and local
patterns in the extracted features [15], while LSTM networks excel at modeling temporal dependencies and capturing long-term
sequential patterns. The combination of CNNs and LSTMs allows for a comprehensive understanding of the emotional content
present in speech signals.
Training and Model Optimization: The deep learning models are trained using labeled data, where the extracted features are paired
with their corresponding emotional labels. During the training phase, the models learn to map the input features to their respective
emotional categories. Model optimization techniques, such as gradient descent-based optimization algorithms, are employed to
adjust the model parameters [19] and minimize the classification error. This iterative training process ensures that the models improve
their performance and accurately classify emotions in speech.
Emotion Classification: The final component of the system architecture involves the classification of emotions based on the trained
deep learning models. Given a new speech sample, the extracted features are fed into the models, which output the predicted
emotional category for that sample. The classification results provide information about the recognized emotions present in the input
speech signal. By integrating these components, the proposed system architecture enables the recognition of speech emotions using
deep learning techniques. It captures relevant features, leverages the power of deep learning models, and provides accurate emotion
classification for a given speech input [20].
Figure 7: Speech Feature Classification
CHAPTER 4 IMPLEMENTATION
4.1 INTRODUCTION
In this chapter, we present the implementation details of our speech emotion recognition system using deep learning. This chapter
aims to provide an overview of the implementation process, including the dataset used, evaluation metrics, and the overall
experimental setup. We will discuss the steps taken to preprocess the data, extract features, train the deep learning models, and
evaluate the performance of the system. Implementing a speech emotion recognition system involves various considerations, such as
the availability of a suitable dataset, selection of appropriate evaluation metrics, and careful implementation of the preprocessing and
feature extraction techniques. These factors play a vital role in the accuracy and effectiveness of the system.
In this section, we will outline the key aspects of the implementation, including:
Dataset: The choice of an appropriate dataset is crucial for training and evaluating the performance of the speech emotion
recognition system. We will provide details about the dataset used in our implementation, such as the size, composition, and
annotation of emotional labels. The dataset serves as the foundation for training the deep learning models [16] and evaluating their
performance.
Evaluation Metrics: To assess the performance of the implemented system, it is important to define suitable evaluation metrics. We will
discuss the evaluation metrics employed to measure the accuracy, precision, recall, and F1-score of the system. These metrics allow
us to quantitatively analyze the performance of the system and compare it with existing approaches.
Experimental Setup: We will describe the hardware and software setup used for implementing the speech emotion recognition
system. This includes the specifications of the computing resources, such as the CPU, GPU, and memory, as well as the software
libraries [20] and frameworks utilized for deep learning model development and training.
Preprocessing and Feature Extraction: The preprocessing steps applied to the input speech signals are crucial for enhancing the
quality and removing any noise or artifacts that may affect the emotion recognition process. We will outline the preprocessing
techniques used, such as signal normalization, noise removal, and signal segmentation. Additionally, we will detail the feature
extraction methods employed to extract relevant acoustic and linguistic features from the preprocessed speech signals.
4.2 DATASET DESCRIPTION
The selection of an appropriate dataset is crucial for training and evaluating a speech emotion recognition system. In this section, we
provide a detailed description of the datasets used in our implementation, including their composition, annotation, and relevance to
the task of speech emotion recognition. For our research, we utilized three widely used datasets in the field of speech emotion
recognition: TESS (
Toronto Emotional Speech Set), SAVEE (Surrey Audio-Visual Expressed Emotion), and RAVDESS (Ryerson Audio-Visual Database of
Emotional Speech and Song).
Each dataset
offers a
unique collection of
speech recordings, providing a comprehensive representation of various emotional states [21], refer to table 2.
TESS (Toronto Emotional Speech Set): TESS comprises recordings from two female speakers, one young and one old, resulting in a
total of 2,800 audio files. The dataset includes random words spoken in seven different emotions, allowing for the study of emotional
variations across age groups and genders.
SAVEE (Surrey Audio-Visual Expressed Emotion): SAVEE consists of recordings from four male speakers, resulting in a total of 480
audio files. The dataset features the same set of sentences spoken in seven different emotions, providing insights into the acoustic
expressions of emotions by male speakers.
proves particularly useful when working with audio data, such as in music generation utilizing LSTM models or in Automatic Speech
Recognition tasks. Librosa offers a comprehensive set of tools and functionalities that serve as the fundamental components for
extracting and manipulating music-related information, refer to table 4.
TensorFlow and Keras: TensorFlow and Keras, two popular deep learning frameworks, provided the necessary tools and libraries for
constructing, training, and evaluating our deep learning models. These frameworks offer a high-level interface and efficient
computational capabilities for implementing complex neural network architectures.
Scikit-learn: Scikit-learn, a comprehensive machine learning library in Python, contributed to the implementation of various
preprocessing techniques, feature extraction methods, and classification algorithms. Its extensive collection of functions and
algorithms allowed us to streamline the implementation process and ensure efficient data processing.
Jupyter Notebook: Jupyter Notebook, an interactive computing environment, facilitated the development and experimentation of our
code. Its notebook-style interface enabled us to iteratively develop, visualize, and document our implementation, providing a seamless
workflow for our project.
Library Use pandas Data manipulation and analysis numpy Mathematical operations and array manipulation seaborn Data visualization
matplotlib Plotting and visualization librosa Audio signal processing os Operating system-related functionalities ipython.display Audio
playback within Jupyter Notebook Table 3: Packages used in Speech Emotion Recognition System Package Use sklearn.preprocessing
One-hot encoding of emotion labels keras.models Defining the model architecture keras.layers Defining different layers in the model
tensorflow Deep learning model training and optimization sklearn.metrics Calculation of accuracy tensorflow.math Computation of
the confusion matrix tensorflow.python.ops.numpy_ops Enabling NumPy-like behavior in TensorFlow
Table 4: Libraries used in Speech Emotion Recognition System
4.3.2 Hardware Requirements The implementation of our speech emotion recognition system required a suitable hardware setup to
handle the computational demands of deep learning and signal processing tasks. The key hardware requirements for our project
include:
CPU and Memory: A computer system equipped with a multi-core CPU and sufficient memory capacity was essential for efficiently
processing the large volumes of speech data and conducting intensive computations during model training and evaluation. We
recommend an Intel i5 2.5 GHz (or AMD equivalent) processor capable of boosting up to 3.5 GHz. This ensures efficient processing of
the large volumes of speech data and supports the computational requirements of the implemented models.
GPU (Graphics Processing Unit): To expedite the training process of deep learning models, we utilized a GPU, which offers parallel
processing capabilities and accelerates the computation of neural network operations. The GPU significantly reduces training times
and enhances the overall performance of the system.
GPU (preferred): For enhanced performance during deep learning model training, we recommend a dedicated GPU from NVIDIA or
AMD with a minimum of 4GB VRAM. The GPU's parallel processing capabilities significantly accelerate the computation of neural
network operations.
Memory: A minimum of 8GB RAM is recommended to handle the memory-intensive tasks involved in training and evaluating deep
learning models. Sufficient memory ensures smooth execution and reduces the likelihood of memory-related bottlenecks.
Secondary Storage: We recommend a minimum of 128GB SSD or HDD for efficient data storage and retrieval. This allows for the
storage of large datasets, trained models, and intermediate results generated during the implementation process.
4.4 PRE-PROCESSING AND FEATURE EXTRACTION In order to effectively capture and represent the relevant information contained in
speech signals, preprocessing and feature extraction techniques are employed. These steps aim to enhance the quality of the input
data and extract discriminative features that capture the emotional content of the speech. In this section, we discuss the
preprocessing steps and feature extraction techniques used in our implementation.
4.4.1 Preprocessing Steps
Noise Removal: Prior to feature extraction, it is essential to mitigate the effects of background noise that may interfere with the speech
signal. We applied noise removal techniques, such as spectral subtraction or wavelet denoising, to reduce unwanted noise and
improve the overall signal quality.
Framing: Speech signals are segmented into short frames to capture the temporal variations in speech. Each frame typically spans
around 20-30 milliseconds and is chosen to maintain a balance between capturing temporal information and ensuring sufficient
speech content within each frame [23], refer to Figure 8.
Figure 8: The Hierarchy of a speech emotion recognition system Windowing: To minimize spectral leakage during the Fourier
transform, we applied a windowing function, such as the Hamming or Hanning window, to each frame. This helps to emphasize the
central portion of the frame while reducing the influence of the frame's edges [24].
Pre-emphasis: Pre-emphasis is employed to equalize the frequency spectrum of the speech signal and enhance high-frequency
components. It involves applying a high-pass filter to amplify higher frequencies, compensating for the attenuation of high-frequency
components during recording or transmission.
4.4.2 Feature Extraction Techniques
Mel-Frequency Cepstral Coefficients (MFCC): MFCCs (Mel-frequency cepstral coefficients) are extensively utilized in the field of
speech analysis and recognition. These coefficients effectively capture the spectral envelope of the speech signal, offering a concise
representation of its features.
Table 5: List of features present in an audio signal [25] Feature Name
Description Zero Crossing Rate “The rate at which the signal changes its sign.” Energy “The sum of the signal values squared and
normalized using frame length.” Entropy of Energy “The value of the change in energy.” Spectral Centroid “The value at the center of
the spectrum.” Spectral Spread “The value of the bandwidth in the spectrum.” Spectral Entropy “The value of the change in the spectral
energy.” Spectral Flux “The square of the difference between the spectral energies of consecutive frames.” Spectral Roll off “The value
of the frequency under which 90% of the spectral distribution occurs.” MFCCs “Mel Frequency Cepstral Coefficient values of the
frequency bands distributed in the Mel-scale.” Chroma Vector “The 12 values representing the energy belonging to each pitch class.”
Chroma Deviation “The value of the standard deviation of the Chroma vectors.” To derive MFCCs, the log-scaled Mel-filterbank
energies are computed from each frame, and then the discrete cosine transform (DCT) is applied to these energies. This process
enables the extraction of compact and discriminative speech features for further analysis and recognition purposes [24]. Spectral
Features: Spectral features, such as spectral centroid, spectral bandwidth, and spectral rolloff, provide information about the
distribution of energy across the frequency spectrum. These features can capture variations in pitch, timbre, and other spectral
characteristics associated with different emotions [25]. Pitch and Energy: Pitch represents the fundamental frequency of the speech
signal and can be a
valuable cue for emotion recognition. Energy measures the overall intensity of the speech signal and can provide insights into the
emotional intensity or arousal level. We extracted pitch and energy features using techniques like autocorrelation or cepstral analysis.
4.5 SPEECH EMOTION RECOGNITION TECHNIQUES Speech emotion recognition (SER) techniques aim to automatically detect and
classify emotions expressed in human speech. Here are some used techniques for speech emotion recognition [26].
Acoustic Feature-Based Approaches: These techniques extract various acoustic features from speech signals, such as pitch, energy,
formants, MFCCs (Mel-frequency cepstral coefficients), and prosodic features. Machine learning algorithms, such as support vector
machines (SVM), hidden Markov models (HMM), or deep learning models, are then trained on these features to classify different
emotions.
Lexical and Prosodic Feature-Based Approaches: These techniques focus on extracting features related to the content and prosody of
speech. Lexical features involve analyzing the words and phrases used in speech, while prosodic features capture information about
the rhythm, intonation, and stress patterns. Machine learning models or rule-based systems are employed to classify emotions based
on these features.
Multimodal Approaches: In multimodal SER, multiple sources of information, such as speech, facial expressions, and body gestures,
are combined to recognize emotions. This approach leverages both audio and visual cues to improve the accuracy of emotion
recognition. For instance, audio features can be fused with facial expression features extracted from video recordings to achieve
better results.
Deep Learning Approaches: Deep learning models, particularly recurrent neural networks (RNNs) and convolutional neural networks
(CNNs), have shown promising results in SER. These models can learn hierarchical representations of speech data and capture
temporal dependencies effectively. They can be trained on raw audio signals or on extracted features to classify emotions.
Ensemble Methods: Ensemble methods combine multiple individual models to make predictions. In SER, different models or classifiers
with diverse features or training strategies can be combined to improve the overall emotion recognition performance. Techniques like
voting, stacking, or bagging can be employed to form an ensemble of models.
CHAPTER 5 RESULT AND DISCUSSION
5.1 INTRODUCTION In this section, we present the results and findings of our speech emotion recognition system based on deep
learning techniques. The evaluation of our system provides insights into its performance and effectiveness in accurately classifying
emotions from speech signals. We discuss the evaluation metrics used to measure the system's performance, the datasets employed
for testing, and the experimental setup. Furthermore, we provide an overview of the results obtained and lay the foundation for the
subsequent sections where we delve into a comparative analysis and discussion of our findings.
The primary objective of this evaluation is to assess the capability of our system to accurately recognize and classify emotions from
speech signals. We analyze the effectiveness of our chosen deep learning models, specifically the CNN and LSTM networks, in
capturing and leveraging the intricate patterns and temporal dependencies present in the speech data. Additionally, we evaluate the
performance of our system in comparison to existing methods and assess its ability to generalize well across different datasets and
emotional contexts [16].
To achieve these objectives, we utilized several evaluation metrics, including accuracy, precision, recall, and F1-score, to quantify the
performance of our system. These metrics allow us to measure the system's ability to correctly classify each emotion category and
provide a comprehensive assessment of its overall performance. The datasets used for evaluation include TESS (
Toronto Emotional Speech Set), SAVEE (Surrey Audio-Visual Expressed Emotion), and RAVDESS (Ryerson Audio-Visual Database of
Emotional Speech and Song) [16].
These datasets
offer a diverse range of emotional contexts and provide a benchmark for evaluating the performance of our system across different
genders, age groups, and linguistic variations.
5.2 EVALUATION METRICS In this section, we present the evaluation metrics used to assess the performance of our speech emotion
recognition system based on deep learning techniques. We analyze the results obtained from our experiments and discuss the
accuracy, precision, recall, and F1-score achieved by our system [27], refer to table 6.
Predicted Class
Actual Class
Class = Yes Class = No
Class = Yes
True Positive
False Negative
Class = No
False Positive
True Negative
Table 6: Confusion Matrix
To measure the effectiveness of our system in accurately classifying emotions from speech signals, we utilized the following
evaluation metrics: Accuracy: Accuracy measures the overall correctness of the predictions and is calculated as the ratio of correctly
classified instances to the total number of instances.
Accuracy = (TP + TN) / (TP + TN + FP + FN)
where:
TP = True Positives (number of correctly classified positive instances)
TN = True Negatives (number of correctly classified negative instances) FP = False Positives (number of incorrectly classified positive
instances) FN = False Negatives (number of incorrectly classified negative instances) Precision: Precision measures the proportion of
correctly predicted positive instances (emotions) among the total instances predicted as positive. It indicates the system's ability to
avoid false positive predictions. Precision = TP / (TP + FP)
Recall: Recall measures the proportion of correctly predicted positive instances (emotions) among the total actual positive instances. It
assesses the system's ability to capture all instances of a particular emotion.
Recall = TP / (TP + FN)
F1-score: The F1-score is a metric that balances precision and recall, providing a measure of a system's performance. It is especially
useful when there is an imbalance in emotion class distribution.
F1-score = 2 * (Precision * Recall) / (Precision + Recall)
5.3 RESULT In this section, we present the results obtained from the implementation of the speech emotion recognition system using
MFCC features and a deep learning model.
The system was evaluated on a dataset consisting of speech audio samples labeled with different emotions. The performance of the
model was analyzed using various evaluation metrics, including accuracy, precision, recall, F1 score,
Accuracy: The accuracy of the model's predictions on the test set was found to be
approximately 83.79%. This indicates that the model accurately classified the emotions in the speech audio samples for the majority of
cases.
Precision: The precision, which measures the proportion of correctly predicted positive samples out of all positive predictions, was
calculated to be 0.8379.
Recall: The recall, also known as sensitivity or true positive rate, measures the proportion of correctly predicted positive samples out of
all actual positive samples. It was found to be 0.8349.
F1-Score: The F1-score, calculated as 0.8256, offers a balanced evaluation of the model's performance by considering both precision
and recall. It serves as a comprehensive metric that takes into account the model's ability to make accurate positive predictions
(precision) and its ability to capture all relevant positive instances (recall).
The confusion matrix provides a detailed breakdown of the model's predictions for each emotion class. It reveals the number of true
positive, true negative, false positive, and false negative predictions for each emotion [26]. The specific values for accuracy, precision,
recall, and F1-score, along with the confusion matrix, can be found in the generated output of the code.
Overall, our system demonstrated strong performance across all evaluated datasets, achieving high accuracy and robust precision-
recall trade-offs. The results validate the effectiveness of our chosen deep learning models, CNN and LSTM, in capturing the
emotional content present in speech signals.
5.4 COMPARITIVE ANALYSIS OF DIFFERENT TECHNIQUES In this section, we provide a comparative analysis of different deep learning
techniques commonly employed for speech emotion recognition. Each technique leverages unique architectural designs and learning
mechanisms to extract relevant features and classify emotions from speech signals [28], refer to table 7.
Techniques Accuracy
Precision
Recall
F1-Score
Long Short-Term Memory (LSTM)
0.837920
0.848572
0.829826
0.839074
Convolutional Neural Network (CNN)
0.822479
0.832162
0.813829
0.822944
Recurrent Neural Network (RNN)
0.815725
0.824395
0.811193
0.817746
Table 7: Evaluation Metrics Comparison for Speech Emotion Recognition Models
Evaluation Metric Comparison
LSTM Accuracy Precision Recall F1-Score 0.83792 0.84857199999999999 0.82982599999999995 0.83907399999999999 CNN
Accuracy Precision Recall F1-Score 0.82247899999999996 0.83216199999999996 0.81382900000000002 0.82294400000000001
RNN Accuracy Precision Recall F1-Score 0.81572500000000003 0.82439499999999999 0.81119300000000005
0.81774599999999997
Figure 9: Graph Showing Comparison of evaluation Metrics
5.5 LIMITATIONS AND CHALLENGES Despite the successful implementation of the speech emotion recognition system using deep
learning techniques, there are certain limitations and challenges that need to be acknowledged [29].
Limited Training Data: One of the primary challenges in developing accurate speech emotion recognition models is the availability of
diverse and well-labeled training data. Obtaining large-scale, diverse, and annotated datasets for different languages, accents, and
emotional expressions can be challenging [29]. The limited availability of comprehensive training data may impact the model's ability
to generalize across different populations and contexts.
Overfitting and Generalization: Deep learning models, such as CNNs and LSTMs, have a high capacity to learn intricate patterns and
details from the training data. However, this can sometimes lead to overfitting, where the model performs well on the training data but
fails to generalize to unseen data. Regularization techniques, cross-validation, and careful model selection are essential to mitigate
overfitting and ensure good generalization performance. Dependency on Feature Extraction: The performance of speech emotion
recognition models heavily relies on the quality and relevance of the features extracted from the speech signals. While deep learning
models can automatically learn features from raw data, the selection and engineering of appropriate features still play a crucial role.
Ambiguity and Subjectivity of Emotions: Emotions are complex and multifaceted, and their expression in speech can vary significantly
among individuals. The subjective nature of emotions makes it challenging to define objective ground truth annotations for training
and evaluation. Discrepancies in annotator agreement and individual perception of emotions introduce inherent ambiguity in the
labeled datasets, affecting the model's performance [30]
Real-Time Processing Constraints: Real-time speech emotion recognition applications often face constraints on computational
resources and latency. Deep learning models, especially larger and more complex architectures, can be computationally intensive and
may not be suitable for real-time deployment on resource-constrained devices. Developing lightweight models or exploring hardware
acceleration techniques can help address these constraints.
CHAPTER 6 CONCLUSION AND FUTURE WORK
6.1 CONCLUSION In this study, we investigated the application of deep learning techniques, including Convolutional Neural Networks
(CNNs), Long Short-Term Memory networks (LSTMs), and Recurrent Neural Networks (RNNs), for speech emotion recognition. Our
proposed approach leveraged the power of these deep learning architectures to extract meaningful features from raw speech signals
and achieve accurate emotion classification.
Through comprehensive experimentation and evaluation, we obtained promising results. Our model achieved an impressive accuracy
of 83.79% in recognizing emotions from speech data. This highlights the effectiveness of deep learning techniques in capturing
relevant acoustic features and capturing the intricate dynamics of emotions in speech. Specifically, the combination of CNNs and
LSTMs allowed us to capture both local and global patterns in the speech signals [17,29]. The CNNs were effective in extracting local
spectral and temporal features, while the LSTMs effectively modeled the sequential dependencies within the speech data. Additionally,
the inclusion of RNNs provided further context and enhanced our model's ability to capture long-term dependencies and subtle
emotional cues in the speech signals.
The proposed system demonstrated its potential for real-world applications in various domains such as human-computer interaction,
virtual assistants, and affective computing. The accurate recognition of emotions from speech can significantly enhance the user
experience and enable more natural and personalized interactions [31].
However, it is important to acknowledge some limitations and areas for future improvement. One limitation is the reliance on pre-
defined emotion categories, which may not fully capture the complexity and variability of human emotions. Further research could
explore the use of more fine-grained emotional labels or continuous emotion dimensions for a more nuanced representation.
6.2 SCOPE FOR FUTURE WORK
• Explore new acoustic, prosodic, and linguistic features to capture emotional cues in speech.
• Investigate deep learning architectures, such as CNNs, RNNs, and transformers, to improve emotion recognition from raw speech
signals. • Combine speech with other modalities like facial expressions or physiological signals to enhance the accuracy of multimodal
emotion recognition systems. • Develop transfer learning and domain adaptation techniques to improve performance on smaller or
domain-specific datasets. • Incorporate contextual information to enhance the accuracy and understanding of recognized emotions. •
Investigate unsupervised or weakly supervised learning methods for emotion recognition with limited labeled data. • Develop efficient
algorithms for real-time and low-resource emotion recognition. • Enhance interpretability of emotion recognition models through
attention mechanisms and visualization techniques. • Extend emotion recognition to naturalistic settings, considering challenges like
background noise and overlapping speech. • Integrate speech emotion recognition into personalized applications like virtual
assistants, considering individual user characteristics.
REFERENCES
[1] Mustaqeem, “
A CNN-Assisted Enhanced Audio Signal Processing for Speech Emotion Recognition,” 2019.
https://www.semanticscholar.org/paper/A-CNN-Assisted-Enhanced-Audio-Signal-Processing-for-Mustaqeem-
Kwon/406c689128de141d49b11f6b2c35c7a51e8fd730 [2] “
Speech emotion recognition using convolutional long short-term memory neural network and support vector machines,” IEEE
Conference Publication | IEEE Xplore,
Towards Emotion Recognition from Speech: Definition, Problems and the Materials of Research,”
in Springer eBooks, 2010, pp. 127–143. doi: 10.1007/978-3-642-11684-1_8 [4] Y. Kim and J. Lee, "Emotion recognition based on deep
learning with LSTM-RNN," in Proceedings of the 21st ACM international conference on Multimedia, 2013, pp. 831-834. [5] D. Li, J.-L.
Liu, Z. Yang, L.-Y. Sun, and Z.
Wang, “Speech emotion recognition using recurrent neural networks with directional self-attention,” Expert Systems with
Applications,
vol. 173, p. 114683, Jul. 2021, doi: 10.1016/j.eswa.2021.114683 [6] A. A. Viji, J. Jasper, and T. Latha, “Efficient Emotion Based Automatic
Speech Recognition Using Optimal Deep Learning Approach,” Optik, p. 170375, Dec. 2022, doi: 10.1016/j.ijleo.2022.170375 [7] A.
Koduru et al., “Feature extraction algorithms to improve the speech emotion recognition rate,” Int. J. Speech Technol., vol. 23, no. 1,
pp. 45-55, Jan. 2020. doi:10.1007/s10772-020-09672-4”. [8] K. Liu et al., “GM-TCNet: Gated Multi-scale Temporal Convolutional
Network using Emotion Causality for Speech Emotion Recognition,” Speech Communication, vol. 145, pp. 21–35, Nov. 2022, doi:
10.1016/j.specom.2022.07.005 [9] Y. Chen, R. Chang, and J. Guo, “Emotion Recognition of EEG Signals Based on the Ensemble
Learning Method: AdaBoost,” Mathematical Problems in Engineering, vol. 2021, pp. 1–12, Jan. 2021, doi: 10.1155/2021/8896062
[10] P. Mohammadrezaei, M. Aminan, M. Soltanian, and K. Borna, “Improving CNN-based solutions for emotion recognition using
evolutionary algorithms,” Results in Applied Mathematics, vol. 18, p. 100360, May 2023, doi: 10.1016/j.rinam.2023.100360 [11] C.
Huang, W. Gong, W. Fu, and D. Feng, “A Research of Speech Emotion Recognition Based on Deep Belief Network and SVM,”
Mathematical Problems in Engineering, vol. 2014, pp. 1–7, Jan. 2014,
doi: 10.1155/2014/749604 [12] X. Ke, Y. Zhu, L. Wen, and W.-Z. Zhang, "Speech Emotion Recognition Based on SVM and ANN,"
International Journal of Machine Learning and Computing, vol. 8, no. 3, pp. 198-202, Jun. 2018. doi: 10.18178/ijmlc.2018.8.3.687 [13]
M. S. Likitha, S. C. Gupta, K. Hasitha, and A. U. Raju, "Speech based human emotion recognition using MFCC," 2017. doi:
10.1109/wispnet.2017.8300161 [14] M. Suzuki and J. Qi, “Improvement of multilingual emotion recognition method based on
normalized acoustic features using CRNN,” Procedia Computer Science, vol. 207, pp. 684–691, Jan. 2022, doi:
10.1016/j.procs.2022.09.123 [15] A. Aslam, A. B. Sargano, and Z. Habib, “Attention-based multimodal sentiment analysis and emotion
recognition using deep neural networks,” Applied Soft Computing, vol. 144, p. 110494, Sep. 2023, doi: 10.1016/j.asoc.2023.110494 [16]
Md. R. Ahmed, S. Islam, A. Islam, and S. Shatabda, “An ensemble 1D-CNN-LSTM-GRU model with data augmentation for speech
emotion recognition,” Expert Systems with Applications, vol. 218, p. 119633, May 2023, doi: 10.1016/j.eswa.2023.119633 [17] K.
Manohar and E. Logashanmugam, “Hybrid deep learning with optimal feature selection for speech emotion recognition using
improved meta-heuristic algorithm,” Knowledge Based Systems, vol. 246, p. 108659, Jun. 2022, doi: 10.1016/j.knosys.2022.108659 [18]
R. A. Khalil, E. G. Jones, M. I. Babar, T. Jan, M. A.
Zafar, and T. Alhussain, “Speech Emotion Recognition Using Deep Learning Techniques: A Review,” IEEE Access, vol. 7, pp. 117327–
117345, Jan. 2019, doi: 10.1109/access.2019.2936124 [19]
K. Wang, N. An, B. Li, Y. Zhang, and L. Li, "Speech Emotion Recognition Using Fourier Parameters," IEEE Transactions on Affective
Computing, vol. 6, no. 1, pp. 69-75, Jan. 2015.
doi: 10.1109/taffc.2015.2392101
[20]
A. V. Tsaregorodtsev et al., “The architecture of the emotion recognition program by speech segments,” Procedia Computer Science,
vol. 213, pp. 338–345, Jan. 2022, doi: 10.1016/j.procs.2022.11.076 [21] C. Hema and F. P. G. Marquez, “Emotional speech Recognition
using CNN and Deep learning techniques,” Applied Acoustics, vol. 211, p. 109492, Aug. 2023, doi: 10.1016/j.apacoust.2023.109492 [22]
Y. Kim and J. Lee, "Emotion recognition based on deep learning with LSTM-RNN," in Proceedings of the 21st ACM international
conference on Multimedia, 2013, pp. 831-834. [23] “
Evaluation of the Effect of Frame Size on Speech Emotion Recognition,” IEEE Conference Publication | IEEE Xplore,
Windowing for Speech Emotion Recognition,” IEEE Conference Publication | IEEE Xplore,
Saha, “Modulation spectral features for speech emotion recognition using deep neural networks,” Speech Communication,
vol. 146, pp. 53–69, Jan. 2023, doi: 10.1016/j.specom.2022.11.005 [26] A. Christy, S. Vaithyasubramanian, A. Jesudoss, and M. D. A.
Praveena, “Multimodal speech emotion recognition and classification using convolutional neural network techniques,” International
Journal of Speech Technology, vol. 23, no. 2, pp. 381–388, Jun. 2020, doi: 10.1007/s10772-020-09713-y [27] “
100% MATCHING BLOCK 39/40
An Efficient Speech Emotion Recognition Using Ensemble Method of Supervised Classifiers,” IEEE Conference Publication | IEEE
Xplore,
Comparative Analysis of Speech Emotion Recognition Models and Technique,” IEEE Conference Publication | IEEE Xplore,
Apr. 20, 2023. https://ieeexplore.ieee.org/document/10141044 [29] Y. B. Singh and S. Goel, “A systematic literature review of speech
emotion recognition approaches,” Neurocomputing, vol. 492, pp. 245–263, Jul. 2022, doi: 10.1016/j.neucom.2022.04.028
[30] P. Nandwani and R. Verma, “A review on sentiment analysis and emotion detection from text,” Social Network Analysis and Mining,
vol. 11, no. 1, Aug. 2021, doi: 10.1007/s13278-021-00776-6 [31] S. Ramakrishnan and I. M. M. E. Emary, “Speech emotion recognition
approaches in human computer interaction,” Telecommunication Systems, vol. 52, no. 3, pp. 1467–1478, Sep. 2011, doi:
10.1007/s11235-011-9624-z
1
3
Evaluation Metric Comparison
LSTM Accuracy Precision Recall F1-Score 0.83792 0.84857199999999999 0.82982599999999995 0.83907399999999999 CNN
Accuracy Precision Recall F1-Score 0.82247899999999996 0.83216199999999996 0.81382900000000002 0.82294400000000001
RNN Accuracy Precision Recall F1-Score 0.81572500000000003 0.82439499999999999 0.81119300000000005
0.81774599999999997
[Metadata removed]
Report_updated.docx (D164018692)
Report_updated.docx (D164018692)
TESS Toronto Emotional Speech Set RAVDESS Ryerson Audio- TESS Toronto Emotional Speech Set RAVDESS Ryerson Audio-
Visual Database of Emotional Speech and Song Visual Database of Emotional Speech and Song
https://pdfs.semanticscholar.org/fb17/7971a2f1c4f93728df3ad7b1e0e22b9dc19d.pdf
Convolutional Neural Networks (CNNs), Recurrent Neural convolutional neural network (CNN) [34,35], the recurrent
Network (RNN) [1], and Long Short-Term Memory (LSTM) neural network (RNN) [36], and long short-term memory (LSTM)
[37].
https://www.researchgate.net/publication/350323362_A_Comprehensive_Review_of_Speech_Emotion_Recog ...
Convolutional Neural Networks (CNNs), Recurrent Neural convolutional neural network (CNN) [34,35], the recurrent
Network (RNN) and Long Short-Term Memory (LSTM) neural network (RNN) [36], and long short-term memory (LSTM)
[37].
https://www.researchgate.net/publication/350323362_A_Comprehensive_Review_of_Speech_Emotion_Recog ...
Convolutional Neural Networks (CNNs), Recurrent Neural convolutional neural network (CNN) [34,35], the recurrent
Network (RNN) [5], and Long Short-Term Memory (LSTM) neural network (RNN) [36], and long short-term memory (LSTM)
[37].
https://www.researchgate.net/publication/350323362_A_Comprehensive_Review_of_Speech_Emotion_Recog ...
10/40 SUBMITTED TEXT 10 WORDS 100% MATCHING TEXT 10 WORDS
Convolutional Neural Networks (CNNs), Recurrent Neural convolutional neural network (CNN) [34,35], the recurrent
Network (RNN) and Long Short-Term Memory (LSTM) neural network (RNN) [36], and long short-term memory (LSTM)
[37].
https://www.researchgate.net/publication/350323362_A_Comprehensive_Review_of_Speech_Emotion_Recog ...
In recent years, deep learning techniques, particularly In recent years, deep-learning models (deep neural
Convolutional Neural Networks (CNNs), Recurrent Neural convolutional neural networks (CNNs), and long short-term
Network (RNN) and Long Short-Term Memory (LSTM) memory (LSTM))
https://www.researchgate.net/figure/Structure-of-Deep-Belief-Network-DBN_fig3_318668903
the Berlin Emotional Speech Database (EmoDB) [10] and the the Berlin Emotional Speech Database and the Interactive
Interactive Emotional Dyadic Motion Capture ( Emotional Dyadic Motion Capture
https://www.researchgate.net/publication/257571892_Emotion_recognition_from_speech_A_review
23.pdf (D169944551)
A CNN-Assisted Enhanced Audio Signal Processing for Speech A CNN-assisted enhanced audio signal processing for speech
Emotion Recognition" emotion recognition,"
https://www.researchgate.net/publication/350323362_A_Comprehensive_Review_of_Speech_Emotion_Recog ...
17/40 SUBMITTED TEXT 14 WORDS 75% MATCHING TEXT 14 WORDS
Convolutional Neural Networks (CNNs), Recurrent Neural convolutional neural network (CNN) [34,35], the recurrent
Network (RNN) and Long Short-Term Memory (LSTM) neural network (RNN) [36], and long short-term memory (LSTM)
[37].
https://www.researchgate.net/publication/350323362_A_Comprehensive_Review_of_Speech_Emotion_Recog ...
Convolutional Neural Networks (CNNs), Recurrent Neural convolutional neural network (CNN) [34,35], the recurrent
Network (RNN) and Long Short-Term Memory (LSTM) neural network (RNN) [36], and long short-term memory (LSTM)
[37].
https://www.researchgate.net/publication/350323362_A_Comprehensive_Review_of_Speech_Emotion_Recog ...
Convolutional Neural Networks (CNNs), Recurrent Neural convolutional neural network (CNN) [34,35], the recurrent
Network (RNN) [4] and Long Short-Term Memory (LSTM) neural network (RNN) [36], and long short-term memory (LSTM)
[37].
https://www.researchgate.net/publication/350323362_A_Comprehensive_Review_of_Speech_Emotion_Recog ...
Deep Learning Models: Deep learning models, specifically deep-learning models (deep neural networks, convolutional
Convolutional Neural Networks (CNNs), Recurrent Neural neural networks (CNNs), and long short-term memory (LSTM))
Networks (RNNs) and Long Short-Term Memory (LSTM)
https://www.researchgate.net/figure/Structure-of-Deep-Belief-Network-DBN_fig3_318668903
Toronto Emotional Speech Set), SAVEE (Surrey Audio-Visual Toronto emotional speech set (Tess) But We use Ryerson Audio-
Expressed Emotion), and RAVDESS (Ryerson Audio-Visual Visual Database of Emotional and Song (RAVDESS). Ryerson
Database of Emotional Speech and Song). Audio-Visual Database of Emotional Speech and Song (
https://extrudesign.com/speech-emotion-recognition/
24/40 SUBMITTED TEXT 11 WORDS 100% MATCHING TEXT 11 WORDS
RAVDESS (Ryerson Audio-Visual Database of Emotional Speech RAVDESS). Ryerson Audio-Visual Database of Emotional Speech
and Song): RAVDESS and Song (RAVDESS). (
https://extrudesign.com/speech-emotion-recognition/
Toronto Emotional Speech Set), SAVEE (Surrey Audio-Visual Toronto emotional speech set (Tess) But We use Ryerson Audio-
Expressed Emotion), and RAVDESS (Ryerson Audio-Visual Visual Database of Emotional and Song (RAVDESS). Ryerson
Database of Emotional Speech and Song) [16]. Audio-Visual Database of Emotional Speech and Song (
https://extrudesign.com/speech-emotion-recognition/
A CNN-Assisted Enhanced Audio Signal Processing for Speech A CNN-assisted enhanced audio signal processing for speech
Emotion Recognition,” 2019. emotion recognition,"
https://www.researchgate.net/publication/350323362_A_Comprehensive_Review_of_Speech_Emotion_Recog ...
30/40 SUBMITTED TEXT 21 WORDS 100% MATCHING TEXT 21 WORDS
Speech emotion recognition using convolutional long short- Speech emotion recognition using convolutional long short-
term memory neural network and support vector machines,” term memory neural network and support vector machines |
IEEE Conference Publication | IEEE Xplore, IEEE Conference Publication | IEEE Xplore
https://ieeexplore.ieee.org/document/8282315
Towards Emotion Recognition from Speech: Definition, Towards Emotion Recognition from Speech: Definition,
Problems and the Materials of Research,” Problems and the Materials of Research
https://www.researchgate.net/publication/330344515_Deep_Learning_Models_for_Speech_Emotion_Recogn ...
Report_updated.docx (D164018692)
doi: 10.1155/2014/749604 [12] X. Ke, Y. Zhu, L. Wen, and W.-Z. doi: 10.1007/978-3-642- 21402-8_35. [56] X. Ke, Y. Zhu, L. Wen,
Zhang, "Speech Emotion Recognition Based on SVM and ANN," and W. Zhang, "Speech emotion recognition based on SVM and
ANN,"
https://www.researchgate.net/publication/350323362_A_Comprehensive_Review_of_Speech_Emotion_Recog ...
Evaluation of the Effect of Frame Size on Speech Emotion Evaluation of the Effect of Frame Size on Speech Emotion
Recognition,” IEEE Conference Publication | IEEE Xplore, Recognition | IEEE Conference Publication | IEEE Xplore
https://ieeexplore.ieee.org/document/8567303
37/40 SUBMITTED TEXT 12 WORDS 100% MATCHING TEXT 12 WORDS
Windowing for Speech Emotion Recognition,” IEEE Conference Windowing for Speech Emotion Recognition | IEEE Conference
Publication | IEEE Xplore, Publication | IEEE Xplore
https://ieeexplore.ieee.org/document/8918885
An Efficient Speech Emotion Recognition Using Ensemble An Efficient Speech Emotion Recognition Using Ensemble
Method of Supervised Classifiers,” IEEE Conference Publication | Method of Supervised Classifiers | IEEE Conference Publication |
IEEE Xplore, IEEE Xplore
https://ieeexplore.ieee.org/document/9350913
Comparative Analysis of Speech Emotion Recognition Models Comparative Analysis of Speech Emotion Recognition Models
and Technique,” IEEE Conference Publication | IEEE Xplore, and Technique | IEEE Conference Publication | IEEE Xplore
https://ieeexplore.ieee.org/document/10141044