CNN_based_approach_for_Speech_Emotion_Recognition_Using_MFCC_Croma_and_STFT_Hand-crafted_features
CNN_based_approach_for_Speech_Emotion_Recognition_Using_MFCC_Croma_and_STFT_Hand-crafted_features
Abstract— At present, speech emotion recognition (SER) is a very signals and the extracted features are considered as input for the
challenging and demanding research area because of its wide real- deep learning-based classifying technique
life applications. For SER there are two major challenges: first one
is to identify the relevant feature vector and the second one is to Convolutional Neural Networks (CNN). The contribution in
identify the suitable classifier. To overcome these challenges, we
proposed a model for SER. In this approach firstly we extract the
this paper are given below:
hand-crafted features Mel frequency cepstral coefficient (MFCC), ● We proposed CNN based SER model using MFCC,
Croma and Short-term Fourier Transform (STFT) from the Chroma, and STFT hand-crafted features
emotional speech signals and these extracted features are ● Proposed SER model Evaluates on the RAVDESS, and
considered as input for the considered deep learning classifying combination of the RAVDESS+TESS databases,
technique, Convolutional Neural Networks (CNN). This work RAVDESS + TESS + SAVEE databases.
investigates the credibility of convolutional neural networks using Rest of the paper is divided as, in section 2 related work is
hand-crafted features MFCC with value 13, Croma with value 13 discussed, the proposed model is discussed in section 3, detailed
and STFT. The proposed model comprises 8 CNN layers and
experimental results are given in section 4, in section 5,
output of the last layer is faded to flatten layer and then apply soft-
max activation function to identify the emotions. We employed the conclusion of this paper is discussed.
publicly accessible speech emotional databases RAVDESS, TESS,
and SAVEE, as well as their combinations to evaluate the II. RELATED WORK
performance of our proposed model. Experiments show that in
terms of average accuracy, the suggested model outperforms the Studies mention the advantage of hand-crafted features in
current state-of-the-art SER techniques. the audio data. However, several of them focus on single-mode
Keywords— Speech Emotion Recognition, MFCC, Chroma,
emotion recognition, limiting the development of the models as
STFT, Convolution Neural Network. a whole. The multimodal model utilises CNN or RNN as a
trainable feature extractor and does not consider both temporal
I. INTRODUCTION and global information of the speech at the same time and often
Speech is one of the most prominent modes of overlooks the temporal properties.
communication between humans and can be a feasible way to In a study, long short-term memory and 2 convolutional
interact with computers as well. Human-computer interactions neural networks (CNN LSTM), one of them having 1
can get more personalized and interactive if computers began dimensional CNN LSTM network and other having 2
to predict the emotional state of the interacting speaker. Speech dimensional CNN LSTM network were employed. Both the
is one among the natural ways for humans to express their networks have the same architecture consisting of one long
emotions. Additionally, speech is easier to obtain and hence short- term memory (LSTM) layer and 4 local feature learning
process in real time scenarios. This is why to make machines blocks (LFBs) to extract speech emotions like happiness,
more humanoid, identifying emotions from human speech surprise, disgust, neural, sadness, fear and anger. The model
becomes important. Use of speech signals to recognize showed a high range of accuracy measures that vary over the
emotions is an important as well as challenging field of Human- databases and methods used [1]. Another approach based on
Computer Interaction. SER has many applications such as phoneme sequence and spectrogram which have advantage
assisting human machine interactions, healthcare and security. upon speech was converted into text because it retains the
In this paper we are using a different combination of three missing emotion contents of speech. A combination of
datasets (TESS, SAVEE, RAVDESS) and three hand crafted phoneme and spectrogram based convolutional model was
features Mel frequency cepstral coefficient (MFCC), Croma tested and accepted to be the most effective in recognizing
and Short-term Fourier Transform (STFT). With the help of human emotions on IEMOCAP data set [2]. Some research is
these feature extractors, we extract the features from the speech based on evaluating the stability of neural acoustic emotion
recognition models. After conducting numerous trials on the
981
Authorized licensed use limited to: INTERNATIONAL INSTITUTE OF INFORMATION TECHNOLOGY. Downloaded on March 12,2025 at 14:11:02 UTC from IEEE Xplore. Restrictions apply.
2021 3rd International Conference on Advances in Computing, Communication Control and Networking (ICACCCN)
iCub robot platform in order to narrow the gap between the based on Domain Adaptation (DA) methods considering both
performance of the model during training and testing in a real- domain-invariant and emotion-discriminative features
world setting, the model was able to predict anger, happiness, employing domain label constraint and emotion label
neutral and sadness emotions with an accuracy of around 83.2% constraint. The model dissipated emotion related and unrelated
[3]. features and used a back propagation network on acoustic
With the increase in popularity of auto-encoders in machine features [11]. The problem of deficiency in generalization of
learning, researchers tried to work on the application of emotion classifiers was addressed. A novel unsupervised
adversarial auto-encoders for emotion recognition on the basis domain adaptation model, namely, Universal autoencoders was
of two aspects, one is based on the capacity to encrypt high- proposed. It evaluated and enhanced the system’s performance
dimensional feature vectors to compressed set, the other is when the conditions of training and testing sets were not the
based on their potential to regenerate synthetic samples in the same [12]. A multi-tasking DNN with shared hidden layers
original dataset, which may then be utilised for purposes like (MT-SHL-DNN) was proposed. In this model feature
training emotion detection classifiers [4]. The usage of existing transformations were shared across different emotion
SER that focused on automatic emotion detection on training representations and the output layers were separately associated
and testing sets from the same collected corpus data, the model with every database. The whole model was taken into
was not trained on discriminative environments that arise due consideration to manage the scarcity of unattenuated acoustic
to the significant performance drop in cross-corpus and cross- data [13]. In a study prior to this, a CNN model was proposed
language scenarios. To overcome this issue, evaluation on five that took spectrograms generated from the speech signals as
different approaches on cross-corpus emotion recognition was input. The model was made of 3 convolutional layers, 3 fully
made relative to a Sparse Autoencoder and SVM baseline connected layers that extracted discriminative features from the
system [5]. Another merged approach of CNN that had one 1D spectrogram images and gave back 7 different emotion
CNN and another, 2D CNN layer. Transfer learning was later predictions [14]. The summary of the literature review is
presented to expedite the training of merged CNN. Firstly, the described in Table 1.
1D CNN and 2D CNN were trained and then the studied
features were proposed and fed to merged CNN. Later on, the Table 1: Summarised Literature Review for SER
merged DCNN was set. This provided a result that merged
DCNN improved speech emotion recognition efficiency [7]. It Ref Database Approach Accuracy
is also very well observed that the study and comparison of .
different feature extraction methods were done and the best [1] EmoDB, 1D CNN LSTM, 40.02% to
IEMOCAP 2D CNN LSTM 95.89%
results were obtained and paved the way for real time emotion
[2] IEMOCAP 2D CNN 4% higher
detection over the duration of an utterance [6]. Using Chinese compared to the
Academy of Sciences emotional speech database, the existing
connection in emotion recognition performance and the feature [3] IEMOCAP RNN+CNN. 83.2%
fusions was experimented over two methods, namely, deep [4] IEMOCAP Adversarial Auto- 57.88%
belief networks and support vector machine, abbreviated as encoders(AAE)
DBN and SVM respectively. Usage of SVM multi- [5] FAU-AIBO, DBM based on Comparatively
classification algorithm for the optimization of penalty factor EMoDB, RBM higher accuracy
parameters and kernel function lead to the accuracy of 84.54%. IEMOCAP, than existing
EMOVO, SAVEE
DBNs on the other hand, lead to the mean accuracy of 94.6%
[6] IEMOCAP TDNN-LSTM 70.6%
on both gender-independent and gender-dependent [7] IEMOCAP and DCNN 92.71%
experiments [8]. A more effective method was created in which EmoDB
both SVM and DBNs were used together and combined using [8] CAS emotional DBN and SVM 94.6%, 84.54%
novel classification methods instead of just using them speech database
individually. Experiments were gender-dependent and five [9] Chinese academy SVM +DBN 94.6%
types of features were extracted that included MFCC, short- of science
term zero-crossing rate and energy, pitch and formant. This emotional speech
approach achieved an accuracy of 95.8% [9]. Speech emotion [10] IEMOCAP RNN+CNN 64.78%
[11] INTERSPEECH EDFLM-Emotion 61.63%
recognition was explained using a frame-based framework and
2009 Emotional Discriminator and
later that worked on end-to-end DL and minimal speech Challenge, Domain Invariant
processing to develop intra-utterance dynamics. The model was EmoDB and ABC Feature Learning
based on different variants of recurrent neural network and databases Method
feed-forward architectures. Those experiments highlighted the [12] EmoDB, ABC and Unsupervised 62%, 63.3% and
pros and cons of the prepared models in emotion recognition Geneva Whispered Universum 62.8%
and paralinguistic speech recognition [10]. Emotion Corpus Autoencoders
Unlike the traditional SER approaches, focus was shifted to Adaptive Model
the conventional emotion discriminative information and
feature distribution differences between training and testing
datasets. An Emotion-discriminative and Domain-invariant
Feature Learning Method, in short, EDFLM was proposed
982
Authorized licensed use limited to: INTERNATIONAL INSTITUTE OF INFORMATION TECHNOLOGY. Downloaded on March 12,2025 at 14:11:02 UTC from IEEE Xplore. Restrictions apply.
2021 3rd International Conference on Advances in Computing, Communication Control and Networking (ICACCCN)
983
Authorized licensed use limited to: INTERNATIONAL INSTITUTE OF INFORMATION TECHNOLOGY. Downloaded on March 12,2025 at 14:11:02 UTC from IEEE Xplore. Restrictions apply.
2021 3rd International Conference on Advances in Computing, Communication Control and Networking (ICACCCN)
as h5 file and the emotions from the speech sample was RAVDESS STFT = 13 0.25 0.25 49.88
recognized. The architecture of the proposed SER model is RAVDESS, STFT = 13 0.25 0.25 62.55
given in Fig. 1. TESS
RAVDESS, STFT = 13 0.25 0.25 70.93
TESS,
SAVEE
RAVDESS Chroma 0.25 0.25 49.88
RAVDESS, Chroma 0.25 0.25
TESS 62.55
RAVDESS, Chroma 0.25 0.25
TESS,
SAVEE 70.93
V. CONCLUSIONS
In this paper, we have proposed the SER model. The
proposed model evaluates on the publicly available speech
emotional databases RAVDESS, TESS and SAVEE and on
their combination RAVDESS + TESS, and RAVDESS + TESS
+ SAVEE by using MFCC, Chroma, and STFT hand-crafted
features. From the experimental results, it is found that the
proposed model gives better results 82.25 % for
RAVDESS+TESS combination of databases using MFCC and
for RAVDESS+ TESS + SAVEE combination of databases
Fig 1: Proposed SER Model
model give average accuracy 77.23% using MFCC features,
IV. EXPERIMENTAL RESULTS 70.93% using STFT features and 70.93% using Chroma
features. From these results we can conclude that the proposed
The model was run on different combinations of data sets
SER model gives better results using MFCC features in case of
with various parameters like dropout value, feature extractor
combination of databases. And in case of indivisible databases,
value, learning rate along with test size. Some of them with
RAVDESS with STFT and Chroma features give 11.17% better
significant results have been shown in Table II. After
results than the MFCC features. In future research, we will
analyzing, we found that the combination of RAVDESS +
continue to work along the line to improve the application of
TESS with MFCC feature extractor with value 13 using
SER and try some merge combinations of classifiers also.
dropout 0.25 and test size 0.1 lead to better accuracy. The
detailed Experimental results are given in Table 2. REFERENCES
[1] J. Zhao, X. Mao, and L. Chen, ``Speech emotion recognition using deep
TABLE 2. Accuracy of different combinations 1D & 2D CNN LSTM networks,'' Biomed. Signal Process. Control,vol.
Dataset 47, pp. 312_323, Jan. 2019.
Accuracy
Test Size
Feature
``Speech emotion recognition using spectrogram & phoneme
Extractor embedding,'' in Proc. Interspeech, 2018, pp. 3688_3692.
[3] E. Lakomkin, M. A. Zamani, C. Weber, S. Magg, and S. Wermter, `On
the robustness of speech emotion recognition for human-robot
RAVDESS MFCC = 13 0.25 0.25 38.71 interaction with deep neural networks,'' in Proc. IEEE/RSJ Int. Conf.
Intell. Robots Syst. (IROS), Oct. 2018, pp. 854_860.
RAVDESS, MFCC = 13 0.25 0.25 80.86 [4] S. Sahu, R. Gupta, G. Sivaraman,W. AbdAlmageed, and C. Espy-
TESS Wilson, ``Adversarial auto-encoders for speech based emotion
RAVDESS, MFCC = 13 0.50 0.25 71.59 recognition,'' 2018, arXiv:1806.02146. [Online]. Available:
TESS https://arxiv.org/abs/1806.02146
[5] S. Latif, R. Rana, S. Younis, J. Qadir, and J. Epps, ``Transfer learning
RAVDESS, MFCC = 13 0.35 0.25 76.79 for improving speech emotion classi_cation accuracy,'' 2018,
TESS arXiv:1801.06353. [Online]. Available:
RAVDESS, MFCC = 13 0.40 0.25 74.20 https://arxiv.org/abs/1801.06353
[6] M. Sarma, P. Ghahremani, D. Povey, N. K. Goel, K. K. Sarma, and N.
TESS Dehak, ``Emotion identi_cation from raw speech signals using DNNs,''
RAVDESS, MFCC = 13 0.40 0.30 74.49 in Proc. Interspeech, 2018, pp. 3097_3101.
TESS [7] J. Zhao, X. Mao, and L. Chen, ``Learning deep features to recognise
RAVDESS, MFCC = 13 0.25 0.10 82.25 speech emotion using merged deep CNN,'' IET Signal Process., vol. 12,
no. 6, pp. 713_721, 2018.
TESS [8] W. Zhang, D. Zhao, Z. Chai, L. T. Yang, X. Liu, F. Gong, and S. Yang,
RAVDESS, MFCC = 13 0.30 0.10 79.96 ``Deep learning and SVM-based emotion recognition from Chinese
TESS speech for smart affective services,'' Softw., Pract. Exper., vol. 47, no. 8,
pp. 1127_1138, 2017.
RAVDESS, MFCC = 13 0.25 0.25 77.23 [9] L. Zhu, L. Chen, D. Zhao, J. Zhou, andW. Zhang, ``Emotion recognition
TESS, from Chinese speech for smart affective services using a combination of
SAVEE SVM and DBN,'' Sensors, vol. 17, no. 7, p. 1694, 2017.
984
Authorized licensed use limited to: INTERNATIONAL INSTITUTE OF INFORMATION TECHNOLOGY. Downloaded on March 12,2025 at 14:11:02 UTC from IEEE Xplore. Restrictions apply.
2021 3rd International Conference on Advances in Computing, Communication Control and Networking (ICACCCN)
985
Authorized licensed use limited to: INTERNATIONAL INSTITUTE OF INFORMATION TECHNOLOGY. Downloaded on March 12,2025 at 14:11:02 UTC from IEEE Xplore. Restrictions apply.