0% found this document useful (0 votes)

9 views

CNN_based_approach_for_Speech_Emotion_Recognition_Using_MFCC_Croma_and_STFT_Hand-crafted_features

The document presents a CNN-based model for Speech Emotion Recognition (SER) that utilizes hand-crafted features such as MFCC, Chroma, and STFT to classify emotional speech signals. The proposed model, evaluated on multiple databases including RAVDESS, TESS, and SAVEE, demonstrates improved accuracy over existing SER techniques. The paper discusses the architecture of the CNN, feature extraction methods, and the significance of using diverse datasets for effective emotion recognition.

Uploaded by

sriramarathanreddyk

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

9 views

CNN_based_approach_for_Speech_Emotion_Recognition_Using_MFCC_Croma_and_STFT_Hand-crafted_features

Uploaded by

sriramarathanreddyk

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 5

2021 3rd International Conference on Advances in Computing, Communication Control and Networking (ICACCCN)

CNN based approach for Speech Emotion

Recognition Using MFCC, Croma and STFT

Hand-crafted features
Nagendra Kumar1, Ratndeep Kaushal1, Shubhi Agarwal1, and Youddha Beer Singh2*
1CSE, Galgotias College of Engineering and Technology

Gautham Buddha Nagar, Up, India

2CSIT, KIET Group of Institutions, Ghaziabad, UP India

[email protected]*

Abstract— At present, speech emotion recognition (SER) is a very signals and the extracted features are considered as input for the
challenging and demanding research area because of its wide real- deep learning-based classifying technique
life applications. For SER there are two major challenges: first one
is to identify the relevant feature vector and the second one is to Convolutional Neural Networks (CNN). The contribution in
identify the suitable classifier. To overcome these challenges, we
proposed a model for SER. In this approach firstly we extract the
this paper are given below:
hand-crafted features Mel frequency cepstral coefficient (MFCC), ● We proposed CNN based SER model using MFCC,
Croma and Short-term Fourier Transform (STFT) from the Chroma, and STFT hand-crafted features
emotional speech signals and these extracted features are ● Proposed SER model Evaluates on the RAVDESS, and
considered as input for the considered deep learning classifying combination of the RAVDESS+TESS databases,
technique, Convolutional Neural Networks (CNN). This work RAVDESS + TESS + SAVEE databases.
investigates the credibility of convolutional neural networks using Rest of the paper is divided as, in section 2 related work is
hand-crafted features MFCC with value 13, Croma with value 13 discussed, the proposed model is discussed in section 3, detailed
and STFT. The proposed model comprises 8 CNN layers and
experimental results are given in section 4, in section 5,
output of the last layer is faded to flatten layer and then apply soft-
max activation function to identify the emotions. We employed the conclusion of this paper is discussed.
publicly accessible speech emotional databases RAVDESS, TESS,
and SAVEE, as well as their combinations to evaluate the II. RELATED WORK
performance of our proposed model. Experiments show that in
terms of average accuracy, the suggested model outperforms the Studies mention the advantage of hand-crafted features in
current state-of-the-art SER techniques. the audio data. However, several of them focus on single-mode
Keywords— Speech Emotion Recognition, MFCC, Chroma,
emotion recognition, limiting the development of the models as
STFT, Convolution Neural Network. a whole. The multimodal model utilises CNN or RNN as a
trainable feature extractor and does not consider both temporal
I. INTRODUCTION and global information of the speech at the same time and often
Speech is one of the most prominent modes of overlooks the temporal properties.
communication between humans and can be a feasible way to In a study, long short-term memory and 2 convolutional
interact with computers as well. Human-computer interactions neural networks (CNN LSTM), one of them having 1
can get more personalized and interactive if computers began dimensional CNN LSTM network and other having 2
to predict the emotional state of the interacting speaker. Speech dimensional CNN LSTM network were employed. Both the
is one among the natural ways for humans to express their networks have the same architecture consisting of one long
emotions. Additionally, speech is easier to obtain and hence short- term memory (LSTM) layer and 4 local feature learning
process in real time scenarios. This is why to make machines blocks (LFBs) to extract speech emotions like happiness,
more humanoid, identifying emotions from human speech surprise, disgust, neural, sadness, fear and anger. The model
becomes important. Use of speech signals to recognize showed a high range of accuracy measures that vary over the
emotions is an important as well as challenging field of Human- databases and methods used [1]. Another approach based on
Computer Interaction. SER has many applications such as phoneme sequence and spectrogram which have advantage
assisting human machine interactions, healthcare and security. upon speech was converted into text because it retains the
In this paper we are using a different combination of three missing emotion contents of speech. A combination of
datasets (TESS, SAVEE, RAVDESS) and three hand crafted phoneme and spectrogram based convolutional model was
features Mel frequency cepstral coefficient (MFCC), Croma tested and accepted to be the most effective in recognizing
and Short-term Fourier Transform (STFT). With the help of human emotions on IEMOCAP data set [2]. Some research is
these feature extractors, we extract the features from the speech based on evaluating the stability of neural acoustic emotion
recognition models. After conducting numerous trials on the

ISBN: 978-1-6654-3811-7/21/$31.00 ©2021 IEEE

981

Authorized licensed use limited to: INTERNATIONAL INSTITUTE OF INFORMATION TECHNOLOGY. Downloaded on March 12,2025 at 14:11:02 UTC from IEEE Xplore. Restrictions apply.
2021 3rd International Conference on Advances in Computing, Communication Control and Networking (ICACCCN)

iCub robot platform in order to narrow the gap between the based on Domain Adaptation (DA) methods considering both
performance of the model during training and testing in a real- domain-invariant and emotion-discriminative features
world setting, the model was able to predict anger, happiness, employing domain label constraint and emotion label
neutral and sadness emotions with an accuracy of around 83.2% constraint. The model dissipated emotion related and unrelated
[3]. features and used a back propagation network on acoustic
With the increase in popularity of auto-encoders in machine features [11]. The problem of deficiency in generalization of
learning, researchers tried to work on the application of emotion classifiers was addressed. A novel unsupervised
adversarial auto-encoders for emotion recognition on the basis domain adaptation model, namely, Universal autoencoders was
of two aspects, one is based on the capacity to encrypt high- proposed. It evaluated and enhanced the system’s performance
dimensional feature vectors to compressed set, the other is when the conditions of training and testing sets were not the
based on their potential to regenerate synthetic samples in the same [12]. A multi-tasking DNN with shared hidden layers
original dataset, which may then be utilised for purposes like (MT-SHL-DNN) was proposed. In this model feature
training emotion detection classifiers [4]. The usage of existing transformations were shared across different emotion
SER that focused on automatic emotion detection on training representations and the output layers were separately associated
and testing sets from the same collected corpus data, the model with every database. The whole model was taken into
was not trained on discriminative environments that arise due consideration to manage the scarcity of unattenuated acoustic
to the significant performance drop in cross-corpus and cross- data [13]. In a study prior to this, a CNN model was proposed
language scenarios. To overcome this issue, evaluation on five that took spectrograms generated from the speech signals as
different approaches on cross-corpus emotion recognition was input. The model was made of 3 convolutional layers, 3 fully
made relative to a Sparse Autoencoder and SVM baseline connected layers that extracted discriminative features from the
system [5]. Another merged approach of CNN that had one 1D spectrogram images and gave back 7 different emotion
CNN and another, 2D CNN layer. Transfer learning was later predictions [14]. The summary of the literature review is
presented to expedite the training of merged CNN. Firstly, the described in Table 1.
1D CNN and 2D CNN were trained and then the studied
features were proposed and fed to merged CNN. Later on, the Table 1: Summarised Literature Review for SER
merged DCNN was set. This provided a result that merged
DCNN improved speech emotion recognition efficiency [7]. It Ref Database Approach Accuracy
is also very well observed that the study and comparison of .
different feature extraction methods were done and the best [1] EmoDB, 1D CNN LSTM, 40.02% to
IEMOCAP 2D CNN LSTM 95.89%
results were obtained and paved the way for real time emotion
[2] IEMOCAP 2D CNN 4% higher
detection over the duration of an utterance [6]. Using Chinese compared to the
Academy of Sciences emotional speech database, the existing
connection in emotion recognition performance and the feature [3] IEMOCAP RNN+CNN. 83.2%
fusions was experimented over two methods, namely, deep [4] IEMOCAP Adversarial Auto- 57.88%
belief networks and support vector machine, abbreviated as encoders(AAE)
DBN and SVM respectively. Usage of SVM multi- [5] FAU-AIBO, DBM based on Comparatively
classification algorithm for the optimization of penalty factor EMoDB, RBM higher accuracy
parameters and kernel function lead to the accuracy of 84.54%. IEMOCAP, than existing
EMOVO, SAVEE
DBNs on the other hand, lead to the mean accuracy of 94.6%
[6] IEMOCAP TDNN-LSTM 70.6%
on both gender-independent and gender-dependent [7] IEMOCAP and DCNN 92.71%
experiments [8]. A more effective method was created in which EmoDB
both SVM and DBNs were used together and combined using [8] CAS emotional DBN and SVM 94.6%, 84.54%
novel classification methods instead of just using them speech database
individually. Experiments were gender-dependent and five [9] Chinese academy SVM +DBN 94.6%
types of features were extracted that included MFCC, short- of science
term zero-crossing rate and energy, pitch and formant. This emotional speech
approach achieved an accuracy of 95.8% [9]. Speech emotion [10] IEMOCAP RNN+CNN 64.78%
[11] INTERSPEECH EDFLM-Emotion 61.63%
recognition was explained using a frame-based framework and
2009 Emotional Discriminator and
later that worked on end-to-end DL and minimal speech Challenge, Domain Invariant
processing to develop intra-utterance dynamics. The model was EmoDB and ABC Feature Learning
based on different variants of recurrent neural network and databases Method
feed-forward architectures. Those experiments highlighted the [12] EmoDB, ABC and Unsupervised 62%, 63.3% and
pros and cons of the prepared models in emotion recognition Geneva Whispered Universum 62.8%
and paralinguistic speech recognition [10]. Emotion Corpus Autoencoders
Unlike the traditional SER approaches, focus was shifted to Adaptive Model
the conventional emotion discriminative information and
feature distribution differences between training and testing
datasets. An Emotion-discriminative and Domain-invariant
Feature Learning Method, in short, EDFLM was proposed

982
Authorized licensed use limited to: INTERNATIONAL INSTITUTE OF INFORMATION TECHNOLOGY. Downloaded on March 12,2025 at 14:11:02 UTC from IEEE Xplore. Restrictions apply.
2021 3rd International Conference on Advances in Computing, Communication Control and Networking (ICACCCN)

III. PROPOSED WORK interpretation, is generated. Zero padding helps construct a

This paper proposes a model for SER. During this approach spectrogram of identical size which is necessary for a CNN
we first extract the hand-crafted features Mel frequency input. The difference between the Mel spectrogram and
cepstral coefficient (MFCC), Croma and Short-term Fourier magnitude spectrograms on the premise of filter bank replicates
Transform (STFT) from the emotional speech signals and these the human ears, concentrating more on lower frequency regions
extracted features are considered as input for the deep learning compared to upper frequencies. Following formula accustomed
classifier, Convolutional Neural Networks(CNN). This work to convert Hertz frequency to Mel scale-
investigates the credibility of convolutional neural networks Mel(f) = 2595log (1 + f / 700) (1)
using hand-crafted features MFCC with value 13, Croma with Chroma is often considered as important for high-level
value 13 and STFT. The proposed model consists of 8 CNN semantic analysis. In high-level tasks, the chroma feature
layers and output of the last layer is faded to flatten layer then enables far better results. Short-time Fourier Transforms and
applying soft-max activation function to spot the emotions to the Constant Q Transform are used to retrieve chroma features.
test the performance of our propped model. We have examined Short-Time Fourier Transform(STFT) may also be inferred as
this model on the publicly available databases RAVDEE, TESS a sequence of Fourier transforms. Signals that change over time
and SAVEE. The details of those methods are discussed in the give time-localized frequency information in frequency
following sections. component situations. STFTs are computed by dividing an
extended temporal signal into smaller segments of equal length
A. Datasets and performing the Fourier transform for each segment with the
In this work we used three datasets RAVDESS, TESS and same method.
SAVEE and their different combinations. RAVDESS: Ryerson
C. Convolutional neural network
Audio-Visual Database of Emotional Speech and Song consists
of seven emotion categories i.e., sad, anger, calm, happy, The architecture of a CNN is made up of three parts:
fearful, surprise, and disgust expressions. With a further neutral convolutional layers, which contain a number of filters to use
expression, each expression encapsulates two levels of intensity on input. During one convolutional layer, each filter scans the
i.e., normal and bold. The speech utterance was recorded input applying a scalar product and submission method to
employing 48 kHz rate as well as 16-bit resolution. The whole produce the number of feature maps. There are several
database comprises 1440 files, containing 24 professionals (12 approaches for reducing dimensionality, including min
males, 12 female). The average timing of the audio files is 3 pooling, max pooling, average pooling, mean pooling, and so
seconds. on, hence the pooling layer is the second important component.
Toronto emotional speech set (TESS) is very high-quality The final component is fully connected flatten layers (FC),
female only audio which comprises 200 target words spoken in which are mostly utilised for extracting features that are then
the voice of two actresses (26 and 64 years old), with the carrier input to a SoftMax classifier to find the probability for each
phrase "Say the word _'. This data set incorporates seven class. This model comprises 8 CNN layers with varied dropout
emotions i.e., pleasant, disgust, surprise, sadness, fear, anger, and output of the last layer are faded to flatten layer and then
happiness, neutral, a total of 2800 data points. The dataset is apply soft-max activation function to spot the emotions, which
organised in an exceedingly well manner such that each of the is used for spectrogram and Mel spectrogram inputs with the
2 females and their emotions are contained within its own initial dimensions of 3540 × 216. The dimensions of the kernel
folder and every 200 target words audio file are often found for the primary CNN layer is 8 × 8 and 216 × 256 features.
within that. The audio file format is WAV. Dimensions of Max pooling is 8 × 8 which is used to scale back
SAVEE: Surrey Audio-Visual Expressed Emotion database the feature maps deducing the attributes to perform the training
was recorded for the development of an automatic emotion on. Fully connected flatten layers combined with a SoftMax
recognition system; the database consists 480 British English layer is examined for getting class probabilities related to every
utterances by 4 male actors using 7 distinct emotions. The spectrogram image. To ensure higher layers are normalized
information was recorded, processed and labelled at a visual after the convolution layers, we used batch normalization. This
media lab with the assistance of high-quality audio-visual improves the stability and performance of deep networks and
hardware. The TIMIT corpus and phonetically-balanced also speeds up the training process. Additionally, dropout (.25)
sentences were chosen for every emotion. The recordings were is accustomed to overcome overfitting issues.
evaluated by 10 subjects under audio, visual and audio-visual D. Architecture of the proposed model
conditions, to test the standard of performance.
In this model we are using a combination of different dataset
B. Feature Extraction RAVDESS, TESS and SAVEE i.e., RAVDESS + TESS +
Three different feature extraction is investigated during this SAVEE, RAVDESS+TESS, TESS + SAVEE,
paper: Mel frequency cepstral coefficient (MFCC), Croma and RAVDESS+SAVEE and individually as well. One of the
Short-term Fourier Transform(STFT); MFCC with value 13 feature extractors among MFCC, STFT and Chroma is selected.
and Croma with value 13. These three feature extraction is The model was trained with the help of a deep learning
chosen on the premise of past research history. The time approach, namely, CNN and the model accuracy was tested.
domain speech files got resized employing a hanning window, The model worked on different individual and combined sets
20ms window size, 10ms overlap for spectrograms and a of mentioned databases against each feature extractor and the
magnitude spectrum for the time-frequency speech one with higher accuracy was opted. The model was later saved

983
Authorized licensed use limited to: INTERNATIONAL INSTITUTE OF INFORMATION TECHNOLOGY. Downloaded on March 12,2025 at 14:11:02 UTC from IEEE Xplore. Restrictions apply.
2021 3rd International Conference on Advances in Computing, Communication Control and Networking (ICACCCN)

as h5 file and the emotions from the speech sample was RAVDESS STFT = 13 0.25 0.25 49.88
recognized. The architecture of the proposed SER model is RAVDESS, STFT = 13 0.25 0.25 62.55
given in Fig. 1. TESS
RAVDESS, STFT = 13 0.25 0.25 70.93
TESS,
SAVEE
RAVDESS Chroma 0.25 0.25 49.88
RAVDESS, Chroma 0.25 0.25
TESS 62.55
RAVDESS, Chroma 0.25 0.25
TESS,
SAVEE 70.93

V. CONCLUSIONS
In this paper, we have proposed the SER model. The
proposed model evaluates on the publicly available speech
emotional databases RAVDESS, TESS and SAVEE and on
their combination RAVDESS + TESS, and RAVDESS + TESS
+ SAVEE by using MFCC, Chroma, and STFT hand-crafted
features. From the experimental results, it is found that the
proposed model gives better results 82.25 % for
RAVDESS+TESS combination of databases using MFCC and
for RAVDESS+ TESS + SAVEE combination of databases
Fig 1: Proposed SER Model
model give average accuracy 77.23% using MFCC features,
IV. EXPERIMENTAL RESULTS 70.93% using STFT features and 70.93% using Chroma
features. From these results we can conclude that the proposed
The model was run on different combinations of data sets
SER model gives better results using MFCC features in case of
with various parameters like dropout value, feature extractor
combination of databases. And in case of indivisible databases,
value, learning rate along with test size. Some of them with
RAVDESS with STFT and Chroma features give 11.17% better
significant results have been shown in Table II. After
results than the MFCC features. In future research, we will
analyzing, we found that the combination of RAVDESS +
continue to work along the line to improve the application of
TESS with MFCC feature extractor with value 13 using
SER and try some merge combinations of classifiers also.
dropout 0.25 and test size 0.1 lead to better accuracy. The
detailed Experimental results are given in Table 2. REFERENCES
[1] J. Zhao, X. Mao, and L. Chen, ``Speech emotion recognition using deep
TABLE 2. Accuracy of different combinations 1D & 2D CNN LSTM networks,'' Biomed. Signal Process. Control,vol.
Dataset 47, pp. 312_323, Jan. 2019.
Accuracy
Test Size

[2] P.Yenigalla, A.Kumar, S. Tripathi, C. Singh, S. Kar, and J.Vepa,

Dropout

Feature
``Speech emotion recognition using spectrogram & phoneme
Extractor embedding,'' in Proc. Interspeech, 2018, pp. 3688_3692.
[3] E. Lakomkin, M. A. Zamani, C. Weber, S. Magg, and S. Wermter, Òn
the robustness of speech emotion recognition for human-robot
RAVDESS MFCC = 13 0.25 0.25 38.71 interaction with deep neural networks,'' in Proc. IEEE/RSJ Int. Conf.
Intell. Robots Syst. (IROS), Oct. 2018, pp. 854_860.
RAVDESS, MFCC = 13 0.25 0.25 80.86 [4] S. Sahu, R. Gupta, G. Sivaraman,W. AbdAlmageed, and C. Espy-
TESS Wilson, `Àdversarial auto-encoders for speech based emotion
RAVDESS, MFCC = 13 0.50 0.25 71.59 recognition,'' 2018, arXiv:1806.02146. [Online]. Available:
TESS https://arxiv.org/abs/1806.02146
[5] S. Latif, R. Rana, S. Younis, J. Qadir, and J. Epps, ``Transfer learning
RAVDESS, MFCC = 13 0.35 0.25 76.79 for improving speech emotion classi_cation accuracy,'' 2018,
TESS arXiv:1801.06353. [Online]. Available:
RAVDESS, MFCC = 13 0.40 0.25 74.20 https://arxiv.org/abs/1801.06353
[6] M. Sarma, P. Ghahremani, D. Povey, N. K. Goel, K. K. Sarma, and N.
TESS Dehak, `Èmotion identi_cation from raw speech signals using DNNs,''
RAVDESS, MFCC = 13 0.40 0.30 74.49 in Proc. Interspeech, 2018, pp. 3097_3101.
TESS [7] J. Zhao, X. Mao, and L. Chen, ``Learning deep features to recognise
RAVDESS, MFCC = 13 0.25 0.10 82.25 speech emotion using merged deep CNN,'' IET Signal Process., vol. 12,
no. 6, pp. 713_721, 2018.
TESS [8] W. Zhang, D. Zhao, Z. Chai, L. T. Yang, X. Liu, F. Gong, and S. Yang,
RAVDESS, MFCC = 13 0.30 0.10 79.96 ``Deep learning and SVM-based emotion recognition from Chinese
TESS speech for smart affective services,'' Softw., Pract. Exper., vol. 47, no. 8,
pp. 1127_1138, 2017.
RAVDESS, MFCC = 13 0.25 0.25 77.23 [9] L. Zhu, L. Chen, D. Zhao, J. Zhou, andW. Zhang, `Èmotion recognition
TESS, from Chinese speech for smart affective services using a combination of
SAVEE SVM and DBN,'' Sensors, vol. 17, no. 7, p. 1694, 2017.

984
Authorized licensed use limited to: INTERNATIONAL INSTITUTE OF INFORMATION TECHNOLOGY. Downloaded on March 12,2025 at 14:11:02 UTC from IEEE Xplore. Restrictions apply.
2021 3rd International Conference on Advances in Computing, Communication Control and Networking (ICACCCN)

[10] H. M. Fayek, M. Lech, and L. Cavedon, ``Evaluating deep learning

architectures for speech emotion recognition,'' Neural Netw., vol. 92, pp.
60_68, Aug. 2017.
[11] Q. Mao, G. Xu, W. Xue, J. Gou, and Y. Zhan, ``Learning emotion
discriminative and domain-invariant features for domain adaptation
inspeech emotion recognition,'' Speech Commun., vol. 93, pp. 1_10, Oct.
2017.
[12] J. Deng, X. Xu, Z. Zhang, and S. Frühholz, and B. Schuller, ``Universum
autoencoder-based domain adaptation for speech emotion
recognition,''IEEE Signal Process. Lett., vol. 24, no. 4, pp. 500_504,
Apr. 2017.
[13] Y. Zhang, Y. Liu, F. Weninger, and B. Schuller, ``Multi-task deep neural
network with shared hidden layers: Breaking down the wall between
emotion representations,'' in Proc. IEEE Int. Conf. Acoust., Speech
Signal Process. (ICASSP), Mar. 2017, pp. 4990_4994.
[14] Badshah, J. Ahmad, N. Rahim, and S. W. Baik, ``Speech emotion
recognition from spectrograms with deep convolutional neural
network,''in Proc. IEEE Int. Conf. Platform Technol. Service (PlatCon),
Feb. 2017, pp. 1_5.

985
Authorized licensed use limited to: INTERNATIONAL INSTITUTE OF INFORMATION TECHNOLOGY. Downloaded on March 12,2025 at 14:11:02 UTC from IEEE Xplore. Restrictions apply.

Project-PPT-Speech Emotion Recognition
85% (13)
Project-PPT-Speech Emotion Recognition
10 pages
Sat - 82.Pdf - Election Prediction With Automated Speech Emotion Recognition
No ratings yet
Sat - 82.Pdf - Election Prediction With Automated Speech Emotion Recognition
11 pages
IJRPR4210
No ratings yet
IJRPR4210
12 pages
Speech Emotion Journal phase 2-3
No ratings yet
Speech Emotion Journal phase 2-3
6 pages
Human Emotion Detection With Speech Recognition Using Mel-Frequency Cepstral Coefficient and CNN - New
No ratings yet
Human Emotion Detection With Speech Recognition Using Mel-Frequency Cepstral Coefficient and CNN - New
2 pages
DL Emotion MFCC
No ratings yet
DL Emotion MFCC
6 pages
SERDL 2
No ratings yet
SERDL 2
10 pages
Speech Emotion Recognition Based On CNN and Random Forest
No ratings yet
Speech Emotion Recognition Based On CNN and Random Forest
5 pages
Research Paper
No ratings yet
Research Paper
5 pages
Research Paper Attri
No ratings yet
Research Paper Attri
7 pages
Recognition_of_emotions_in_speech_using_deep_CNN_a (1)
No ratings yet
Recognition_of_emotions_in_speech_using_deep_CNN_a (1)
18 pages
Chethana H N REPORT
No ratings yet
Chethana H N REPORT
12 pages
Exploring the Effectiveness of Advanced Machine Learning Models in Speech Emotion Recognition
No ratings yet
Exploring the Effectiveness of Advanced Machine Learning Models in Speech Emotion Recognition
6 pages
Sensors 23 06212 v2
No ratings yet
Sensors 23 06212 v2
20 pages
Speech Emotion Recognition1
No ratings yet
Speech Emotion Recognition1
86 pages
Presentation1 (Autosaved) (Autosaved)
No ratings yet
Presentation1 (Autosaved) (Autosaved)
20 pages
2411.02964v2
No ratings yet
2411.02964v2
9 pages
Speech Emotion Recognition: Submitted by Manoj Rajput 2019PEC5303
No ratings yet
Speech Emotion Recognition: Submitted by Manoj Rajput 2019PEC5303
11 pages
JETIR2106163 (37)
No ratings yet
JETIR2106163 (37)
5 pages
An Ensemble 1D-CNN-LSTM-GRU Model With Data Augmentation For Speech Emotion Recognition
No ratings yet
An Ensemble 1D-CNN-LSTM-GRU Model With Data Augmentation For Speech Emotion Recognition
19 pages
SECOND - s11042 023 16849 X
No ratings yet
SECOND - s11042 023 16849 X
18 pages
Speech Emotion Recognition Using Machine Learning
No ratings yet
Speech Emotion Recognition Using Machine Learning
8 pages
Reality
No ratings yet
Reality
11 pages
electronics-12-00839-v2
No ratings yet
electronics-12-00839-v2
17 pages
MiniProject 5
No ratings yet
MiniProject 5
11 pages
FP-05.4
No ratings yet
FP-05.4
6 pages
Deep Learning Based Emotion Recognition System Using Speech Features and Transcriptions
No ratings yet
Deep Learning Based Emotion Recognition System Using Speech Features and Transcriptions
12 pages
Speech-Emotion-Recognition Using SVM, Decision Tree and LDA Report
No ratings yet
Speech-Emotion-Recognition Using SVM, Decision Tree and LDA Report
7 pages
Entropy 21 00479 PDF
No ratings yet
Entropy 21 00479 PDF
17 pages
Enhanced Speech Emotion Detection Using Deep Neural Networks
No ratings yet
Enhanced Speech Emotion Detection Using Deep Neural Networks
14 pages
Electronics 11 03831
No ratings yet
Electronics 11 03831
12 pages
Group 110 Arun Kumar Review 2 Report
No ratings yet
Group 110 Arun Kumar Review 2 Report
14 pages
SER (Research Paper)
No ratings yet
SER (Research Paper)
5 pages
s10772-023-10047-8
No ratings yet
s10772-023-10047-8
13 pages
Speech_Emotion_Recognition_using_Deep_Learning
No ratings yet
Speech_Emotion_Recognition_using_Deep_Learning
6 pages
Irjet V7i6804
No ratings yet
Irjet V7i6804
7 pages
2. Synopsis Content
No ratings yet
2. Synopsis Content
6 pages
CS21B1051
No ratings yet
CS21B1051
27 pages
Speech Emotion Recognition With Deep Learning
No ratings yet
Speech Emotion Recognition With Deep Learning
5 pages
s41598-024-63776-4
No ratings yet
s41598-024-63776-4
17 pages
Project Report SSUC-12 (2)
No ratings yet
Project Report SSUC-12 (2)
2 pages
48
No ratings yet
48
10 pages
Review_3_PPT_final1]
No ratings yet
Review_3_PPT_final1]
51 pages
Emotion Recognition in Audio and Video Using Deep Neural Networks
No ratings yet
Emotion Recognition in Audio and Video Using Deep Neural Networks
9 pages
Zhao 2019
No ratings yet
Zhao 2019
12 pages
Speech Based Emotion Recognition
No ratings yet
Speech Based Emotion Recognition
26 pages
Draft 6
No ratings yet
Draft 6
14 pages
Deep_Learning_Techniques_for_Speech_Emotion_Recognition_A_Review
No ratings yet
Deep_Learning_Techniques_for_Speech_Emotion_Recognition_A_Review
6 pages
Speech-Emotion-Recognition-with-Deep-Learning
No ratings yet
Speech-Emotion-Recognition-with-Deep-Learning
5 pages
IVADED-ppt
No ratings yet
IVADED-ppt
20 pages
Group No 37
No ratings yet
Group No 37
19 pages
Speech_Emotion_Recognition_Using_Deep_Learning_Hybrid_Models
No ratings yet
Speech_Emotion_Recognition_Using_Deep_Learning_Hybrid_Models
5 pages
2 SER using LSTM
No ratings yet
2 SER using LSTM
5 pages
1904.06022v1
No ratings yet
1904.06022v1
9 pages
Real-Time Speech Emotion Recognition Using Deep Le
No ratings yet
Real-Time Speech Emotion Recognition Using Deep Le
40 pages
Semantic Speech Analysis Using Machine Learning and Deep Learning Techniques: A Comprehensive Review
No ratings yet
Semantic Speech Analysis Using Machine Learning and Deep Learning Techniques: A Comprehensive Review
30 pages
Speech Emotion Recognition Using Deep Convolutional Neural Network and Discriminant Temporal Pyramid Matching
No ratings yet
Speech Emotion Recognition Using Deep Convolutional Neural Network and Discriminant Temporal Pyramid Matching
15 pages
EMOTIONDETECTION (1)mini project
No ratings yet
EMOTIONDETECTION (1)mini project
5 pages
Machine Learning - Advanced Concepts
From Everand
Machine Learning - Advanced Concepts
Derrick Mwiti
No ratings yet
Introduction to Quantum Computing & Machine Learning Technologies: 1, #1
From Everand
Introduction to Quantum Computing & Machine Learning Technologies: 1, #1
M. Sreedevi
No ratings yet
Suguna 1 IEEE Xplore Conference
No ratings yet
Suguna 1 IEEE Xplore Conference
7 pages
AI2122
No ratings yet
AI2122
1 page
Intro To Machine Learning Google
No ratings yet
Intro To Machine Learning Google
4 pages
Chapter 7. Object Recognition
No ratings yet
Chapter 7. Object Recognition
105 pages
Binod ML Project-052
No ratings yet
Binod ML Project-052
14 pages
Notes_ML_02_Slides_RNN_ANN
No ratings yet
Notes_ML_02_Slides_RNN_ANN
105 pages
ETI Assignment-1
No ratings yet
ETI Assignment-1
2 pages
FinanceGPT Labs WhitePaper
No ratings yet
FinanceGPT Labs WhitePaper
6 pages
1 s2.0 S2949719124000177 Main
No ratings yet
1 s2.0 S2949719124000177 Main
25 pages
HistoryOfObjectRecognition PDF
No ratings yet
HistoryOfObjectRecognition PDF
2 pages
Data Science 30 Days Learning Plan - by Data Analytics - Mr. Plan Publication - Jun, 2024 - Medium
No ratings yet
Data Science 30 Days Learning Plan - by Data Analytics - Mr. Plan Publication - Jun, 2024 - Medium
11 pages
1102AITA04 AI For Text Analytics
No ratings yet
1102AITA04 AI For Text Analytics
88 pages
YOLO-Green_A_Real-Time_Classification_and_Object_Detection_Model_Optimized_for_Waste_Management
No ratings yet
YOLO-Green_A_Real-Time_Classification_and_Object_Detection_Model_Optimized_for_Waste_Management
7 pages
Chapter 1 (Rich & Knight)
88% (32)
Chapter 1 (Rich & Knight)
38 pages
Travel Chat Bot
No ratings yet
Travel Chat Bot
5 pages
Introduction of Machine Learning
No ratings yet
Introduction of Machine Learning
61 pages
BEASISWA Merged Compressed
No ratings yet
BEASISWA Merged Compressed
45 pages
Supervised Learning Based On Temporal Coding in Spiking Neural Networks
No ratings yet
Supervised Learning Based On Temporal Coding in Spiking Neural Networks
9 pages
Seminar Report Deep-Learning PDF
No ratings yet
Seminar Report Deep-Learning PDF
26 pages
Video Transformer Network
No ratings yet
Video Transformer Network
11 pages
202410 Aml Ccs Membership Inf Vision
No ratings yet
202410 Aml Ccs Membership Inf Vision
15 pages
Machine Learning: Version 2 CSE IIT, Kharagpur
No ratings yet
Machine Learning: Version 2 CSE IIT, Kharagpur
5 pages
Rich and Knight Artificial Intelligence Solutions PDF
No ratings yet
Rich and Knight Artificial Intelligence Solutions PDF
4 pages
Machine Learning Notes
No ratings yet
Machine Learning Notes
6 pages
Ann FL
No ratings yet
Ann FL
102 pages
AI Imo Qs.
No ratings yet
AI Imo Qs.
4 pages
MTP: Advancing Remote Sensing Foundation Model Via Multi-Task Pretraining
No ratings yet
MTP: Advancing Remote Sensing Foundation Model Via Multi-Task Pretraining
21 pages
SVM Tutorial
No ratings yet
SVM Tutorial
31 pages
Python Code
100% (1)
Python Code
2 pages
015 - Random Forest
No ratings yet
015 - Random Forest
15 pages

CNN_based_approach_for_Speech_Emotion_Recognition_Using_MFCC_Croma_and_STFT_Hand-crafted_features

Uploaded by

CNN_based_approach_for_Speech_Emotion_Recognition_Using_MFCC_Croma_and_STFT_Hand-crafted_features

Uploaded by

2021 3rd International Conference on Advances in Computing, Communication Control and Networking (ICACCCN)

CNN based approach for Speech Emotion

Recognition Using MFCC, Croma and STFT

Gautham Buddha Nagar, Up, India

ISBN: 978-1-6654-3811-7/21/$31.00 ©2021 IEEE

III. PROPOSED WORK interpretation, is generated. Zero padding helps construct a

[2] P.Yenigalla, A.Kumar, S. Tripathi, C. Singh, S. Kar, and J.Vepa,

[10] H. M. Fayek, M. Lech, and L. Cavedon, ``Evaluating deep learning

You might also like