0% found this document useful (0 votes)
9 views

CNN_based_approach_for_Speech_Emotion_Recognition_Using_MFCC_Croma_and_STFT_Hand-crafted_features

The document presents a CNN-based model for Speech Emotion Recognition (SER) that utilizes hand-crafted features such as MFCC, Chroma, and STFT to classify emotional speech signals. The proposed model, evaluated on multiple databases including RAVDESS, TESS, and SAVEE, demonstrates improved accuracy over existing SER techniques. The paper discusses the architecture of the CNN, feature extraction methods, and the significance of using diverse datasets for effective emotion recognition.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

CNN_based_approach_for_Speech_Emotion_Recognition_Using_MFCC_Croma_and_STFT_Hand-crafted_features

The document presents a CNN-based model for Speech Emotion Recognition (SER) that utilizes hand-crafted features such as MFCC, Chroma, and STFT to classify emotional speech signals. The proposed model, evaluated on multiple databases including RAVDESS, TESS, and SAVEE, demonstrates improved accuracy over existing SER techniques. The paper discusses the architecture of the CNN, feature extraction methods, and the significance of using diverse datasets for effective emotion recognition.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

2021 3rd International Conference on Advances in Computing, Communication Control and Networking (ICACCCN)

CNN based approach for Speech Emotion


2021 3rd International Conference on Advances in Computing, Communication Control and Networking (ICAC3N) | 978-1-6654-3811-7/21/$31.00 ©2021 IEEE | DOI: 10.1109/ICAC3N53548.2021.9725750

Recognition Using MFCC, Croma and STFT


Hand-crafted features
Nagendra Kumar1, Ratndeep Kaushal1, Shubhi Agarwal1, and Youddha Beer Singh2*
1CSE, Galgotias College of Engineering and Technology

Gautham Buddha Nagar, Up, India


2CSIT, KIET Group of Institutions, Ghaziabad, UP India

[email protected]*

Abstract— At present, speech emotion recognition (SER) is a very signals and the extracted features are considered as input for the
challenging and demanding research area because of its wide real- deep learning-based classifying technique
life applications. For SER there are two major challenges: first one
is to identify the relevant feature vector and the second one is to Convolutional Neural Networks (CNN). The contribution in
identify the suitable classifier. To overcome these challenges, we
proposed a model for SER. In this approach firstly we extract the
this paper are given below:
hand-crafted features Mel frequency cepstral coefficient (MFCC), ● We proposed CNN based SER model using MFCC,
Croma and Short-term Fourier Transform (STFT) from the Chroma, and STFT hand-crafted features
emotional speech signals and these extracted features are ● Proposed SER model Evaluates on the RAVDESS, and
considered as input for the considered deep learning classifying combination of the RAVDESS+TESS databases,
technique, Convolutional Neural Networks (CNN). This work RAVDESS + TESS + SAVEE databases.
investigates the credibility of convolutional neural networks using Rest of the paper is divided as, in section 2 related work is
hand-crafted features MFCC with value 13, Croma with value 13 discussed, the proposed model is discussed in section 3, detailed
and STFT. The proposed model comprises 8 CNN layers and
experimental results are given in section 4, in section 5,
output of the last layer is faded to flatten layer and then apply soft-
max activation function to identify the emotions. We employed the conclusion of this paper is discussed.
publicly accessible speech emotional databases RAVDESS, TESS,
and SAVEE, as well as their combinations to evaluate the II. RELATED WORK
performance of our proposed model. Experiments show that in
terms of average accuracy, the suggested model outperforms the Studies mention the advantage of hand-crafted features in
current state-of-the-art SER techniques. the audio data. However, several of them focus on single-mode
Keywords— Speech Emotion Recognition, MFCC, Chroma,
emotion recognition, limiting the development of the models as
STFT, Convolution Neural Network. a whole. The multimodal model utilises CNN or RNN as a
trainable feature extractor and does not consider both temporal
I. INTRODUCTION and global information of the speech at the same time and often
Speech is one of the most prominent modes of overlooks the temporal properties.
communication between humans and can be a feasible way to In a study, long short-term memory and 2 convolutional
interact with computers as well. Human-computer interactions neural networks (CNN LSTM), one of them having 1
can get more personalized and interactive if computers began dimensional CNN LSTM network and other having 2
to predict the emotional state of the interacting speaker. Speech dimensional CNN LSTM network were employed. Both the
is one among the natural ways for humans to express their networks have the same architecture consisting of one long
emotions. Additionally, speech is easier to obtain and hence short- term memory (LSTM) layer and 4 local feature learning
process in real time scenarios. This is why to make machines blocks (LFBs) to extract speech emotions like happiness,
more humanoid, identifying emotions from human speech surprise, disgust, neural, sadness, fear and anger. The model
becomes important. Use of speech signals to recognize showed a high range of accuracy measures that vary over the
emotions is an important as well as challenging field of Human- databases and methods used [1]. Another approach based on
Computer Interaction. SER has many applications such as phoneme sequence and spectrogram which have advantage
assisting human machine interactions, healthcare and security. upon speech was converted into text because it retains the
In this paper we are using a different combination of three missing emotion contents of speech. A combination of
datasets (TESS, SAVEE, RAVDESS) and three hand crafted phoneme and spectrogram based convolutional model was
features Mel frequency cepstral coefficient (MFCC), Croma tested and accepted to be the most effective in recognizing
and Short-term Fourier Transform (STFT). With the help of human emotions on IEMOCAP data set [2]. Some research is
these feature extractors, we extract the features from the speech based on evaluating the stability of neural acoustic emotion
recognition models. After conducting numerous trials on the

ISBN: 978-1-6654-3811-7/21/$31.00 ©2021 IEEE

981

Authorized licensed use limited to: INTERNATIONAL INSTITUTE OF INFORMATION TECHNOLOGY. Downloaded on March 12,2025 at 14:11:02 UTC from IEEE Xplore. Restrictions apply.
2021 3rd International Conference on Advances in Computing, Communication Control and Networking (ICACCCN)

iCub robot platform in order to narrow the gap between the based on Domain Adaptation (DA) methods considering both
performance of the model during training and testing in a real- domain-invariant and emotion-discriminative features
world setting, the model was able to predict anger, happiness, employing domain label constraint and emotion label
neutral and sadness emotions with an accuracy of around 83.2% constraint. The model dissipated emotion related and unrelated
[3]. features and used a back propagation network on acoustic
With the increase in popularity of auto-encoders in machine features [11]. The problem of deficiency in generalization of
learning, researchers tried to work on the application of emotion classifiers was addressed. A novel unsupervised
adversarial auto-encoders for emotion recognition on the basis domain adaptation model, namely, Universal autoencoders was
of two aspects, one is based on the capacity to encrypt high- proposed. It evaluated and enhanced the system’s performance
dimensional feature vectors to compressed set, the other is when the conditions of training and testing sets were not the
based on their potential to regenerate synthetic samples in the same [12]. A multi-tasking DNN with shared hidden layers
original dataset, which may then be utilised for purposes like (MT-SHL-DNN) was proposed. In this model feature
training emotion detection classifiers [4]. The usage of existing transformations were shared across different emotion
SER that focused on automatic emotion detection on training representations and the output layers were separately associated
and testing sets from the same collected corpus data, the model with every database. The whole model was taken into
was not trained on discriminative environments that arise due consideration to manage the scarcity of unattenuated acoustic
to the significant performance drop in cross-corpus and cross- data [13]. In a study prior to this, a CNN model was proposed
language scenarios. To overcome this issue, evaluation on five that took spectrograms generated from the speech signals as
different approaches on cross-corpus emotion recognition was input. The model was made of 3 convolutional layers, 3 fully
made relative to a Sparse Autoencoder and SVM baseline connected layers that extracted discriminative features from the
system [5]. Another merged approach of CNN that had one 1D spectrogram images and gave back 7 different emotion
CNN and another, 2D CNN layer. Transfer learning was later predictions [14]. The summary of the literature review is
presented to expedite the training of merged CNN. Firstly, the described in Table 1.
1D CNN and 2D CNN were trained and then the studied
features were proposed and fed to merged CNN. Later on, the Table 1: Summarised Literature Review for SER
merged DCNN was set. This provided a result that merged
DCNN improved speech emotion recognition efficiency [7]. It Ref Database Approach Accuracy
is also very well observed that the study and comparison of .
different feature extraction methods were done and the best [1] EmoDB, 1D CNN LSTM, 40.02% to
IEMOCAP 2D CNN LSTM 95.89%
results were obtained and paved the way for real time emotion
[2] IEMOCAP 2D CNN 4% higher
detection over the duration of an utterance [6]. Using Chinese compared to the
Academy of Sciences emotional speech database, the existing
connection in emotion recognition performance and the feature [3] IEMOCAP RNN+CNN. 83.2%
fusions was experimented over two methods, namely, deep [4] IEMOCAP Adversarial Auto- 57.88%
belief networks and support vector machine, abbreviated as encoders(AAE)
DBN and SVM respectively. Usage of SVM multi- [5] FAU-AIBO, DBM based on Comparatively
classification algorithm for the optimization of penalty factor EMoDB, RBM higher accuracy
parameters and kernel function lead to the accuracy of 84.54%. IEMOCAP, than existing
EMOVO, SAVEE
DBNs on the other hand, lead to the mean accuracy of 94.6%
[6] IEMOCAP TDNN-LSTM 70.6%
on both gender-independent and gender-dependent [7] IEMOCAP and DCNN 92.71%
experiments [8]. A more effective method was created in which EmoDB
both SVM and DBNs were used together and combined using [8] CAS emotional DBN and SVM 94.6%, 84.54%
novel classification methods instead of just using them speech database
individually. Experiments were gender-dependent and five [9] Chinese academy SVM +DBN 94.6%
types of features were extracted that included MFCC, short- of science
term zero-crossing rate and energy, pitch and formant. This emotional speech
approach achieved an accuracy of 95.8% [9]. Speech emotion [10] IEMOCAP RNN+CNN 64.78%
[11] INTERSPEECH EDFLM-Emotion 61.63%
recognition was explained using a frame-based framework and
2009 Emotional Discriminator and
later that worked on end-to-end DL and minimal speech Challenge, Domain Invariant
processing to develop intra-utterance dynamics. The model was EmoDB and ABC Feature Learning
based on different variants of recurrent neural network and databases Method
feed-forward architectures. Those experiments highlighted the [12] EmoDB, ABC and Unsupervised 62%, 63.3% and
pros and cons of the prepared models in emotion recognition Geneva Whispered Universum 62.8%
and paralinguistic speech recognition [10]. Emotion Corpus Autoencoders
Unlike the traditional SER approaches, focus was shifted to Adaptive Model
the conventional emotion discriminative information and
feature distribution differences between training and testing
datasets. An Emotion-discriminative and Domain-invariant
Feature Learning Method, in short, EDFLM was proposed

982
Authorized licensed use limited to: INTERNATIONAL INSTITUTE OF INFORMATION TECHNOLOGY. Downloaded on March 12,2025 at 14:11:02 UTC from IEEE Xplore. Restrictions apply.
2021 3rd International Conference on Advances in Computing, Communication Control and Networking (ICACCCN)

III. PROPOSED WORK interpretation, is generated. Zero padding helps construct a


This paper proposes a model for SER. During this approach spectrogram of identical size which is necessary for a CNN
we first extract the hand-crafted features Mel frequency input. The difference between the Mel spectrogram and
cepstral coefficient (MFCC), Croma and Short-term Fourier magnitude spectrograms on the premise of filter bank replicates
Transform (STFT) from the emotional speech signals and these the human ears, concentrating more on lower frequency regions
extracted features are considered as input for the deep learning compared to upper frequencies. Following formula accustomed
classifier, Convolutional Neural Networks(CNN). This work to convert Hertz frequency to Mel scale-
investigates the credibility of convolutional neural networks Mel(f) = 2595log (1 + f / 700) (1)
using hand-crafted features MFCC with value 13, Croma with Chroma is often considered as important for high-level
value 13 and STFT. The proposed model consists of 8 CNN semantic analysis. In high-level tasks, the chroma feature
layers and output of the last layer is faded to flatten layer then enables far better results. Short-time Fourier Transforms and
applying soft-max activation function to spot the emotions to the Constant Q Transform are used to retrieve chroma features.
test the performance of our propped model. We have examined Short-Time Fourier Transform(STFT) may also be inferred as
this model on the publicly available databases RAVDEE, TESS a sequence of Fourier transforms. Signals that change over time
and SAVEE. The details of those methods are discussed in the give time-localized frequency information in frequency
following sections. component situations. STFTs are computed by dividing an
extended temporal signal into smaller segments of equal length
A. Datasets and performing the Fourier transform for each segment with the
In this work we used three datasets RAVDESS, TESS and same method.
SAVEE and their different combinations. RAVDESS: Ryerson
C. Convolutional neural network
Audio-Visual Database of Emotional Speech and Song consists
of seven emotion categories i.e., sad, anger, calm, happy, The architecture of a CNN is made up of three parts:
fearful, surprise, and disgust expressions. With a further neutral convolutional layers, which contain a number of filters to use
expression, each expression encapsulates two levels of intensity on input. During one convolutional layer, each filter scans the
i.e., normal and bold. The speech utterance was recorded input applying a scalar product and submission method to
employing 48 kHz rate as well as 16-bit resolution. The whole produce the number of feature maps. There are several
database comprises 1440 files, containing 24 professionals (12 approaches for reducing dimensionality, including min
males, 12 female). The average timing of the audio files is 3 pooling, max pooling, average pooling, mean pooling, and so
seconds. on, hence the pooling layer is the second important component.
Toronto emotional speech set (TESS) is very high-quality The final component is fully connected flatten layers (FC),
female only audio which comprises 200 target words spoken in which are mostly utilised for extracting features that are then
the voice of two actresses (26 and 64 years old), with the carrier input to a SoftMax classifier to find the probability for each
phrase "Say the word _'. This data set incorporates seven class. This model comprises 8 CNN layers with varied dropout
emotions i.e., pleasant, disgust, surprise, sadness, fear, anger, and output of the last layer are faded to flatten layer and then
happiness, neutral, a total of 2800 data points. The dataset is apply soft-max activation function to spot the emotions, which
organised in an exceedingly well manner such that each of the is used for spectrogram and Mel spectrogram inputs with the
2 females and their emotions are contained within its own initial dimensions of 3540 × 216. The dimensions of the kernel
folder and every 200 target words audio file are often found for the primary CNN layer is 8 × 8 and 216 × 256 features.
within that. The audio file format is WAV. Dimensions of Max pooling is 8 × 8 which is used to scale back
SAVEE: Surrey Audio-Visual Expressed Emotion database the feature maps deducing the attributes to perform the training
was recorded for the development of an automatic emotion on. Fully connected flatten layers combined with a SoftMax
recognition system; the database consists 480 British English layer is examined for getting class probabilities related to every
utterances by 4 male actors using 7 distinct emotions. The spectrogram image. To ensure higher layers are normalized
information was recorded, processed and labelled at a visual after the convolution layers, we used batch normalization. This
media lab with the assistance of high-quality audio-visual improves the stability and performance of deep networks and
hardware. The TIMIT corpus and phonetically-balanced also speeds up the training process. Additionally, dropout (.25)
sentences were chosen for every emotion. The recordings were is accustomed to overcome overfitting issues.
evaluated by 10 subjects under audio, visual and audio-visual D. Architecture of the proposed model
conditions, to test the standard of performance.
In this model we are using a combination of different dataset
B. Feature Extraction RAVDESS, TESS and SAVEE i.e., RAVDESS + TESS +
Three different feature extraction is investigated during this SAVEE, RAVDESS+TESS, TESS + SAVEE,
paper: Mel frequency cepstral coefficient (MFCC), Croma and RAVDESS+SAVEE and individually as well. One of the
Short-term Fourier Transform(STFT); MFCC with value 13 feature extractors among MFCC, STFT and Chroma is selected.
and Croma with value 13. These three feature extraction is The model was trained with the help of a deep learning
chosen on the premise of past research history. The time approach, namely, CNN and the model accuracy was tested.
domain speech files got resized employing a hanning window, The model worked on different individual and combined sets
20ms window size, 10ms overlap for spectrograms and a of mentioned databases against each feature extractor and the
magnitude spectrum for the time-frequency speech one with higher accuracy was opted. The model was later saved

983
Authorized licensed use limited to: INTERNATIONAL INSTITUTE OF INFORMATION TECHNOLOGY. Downloaded on March 12,2025 at 14:11:02 UTC from IEEE Xplore. Restrictions apply.
2021 3rd International Conference on Advances in Computing, Communication Control and Networking (ICACCCN)

as h5 file and the emotions from the speech sample was RAVDESS STFT = 13 0.25 0.25 49.88
recognized. The architecture of the proposed SER model is RAVDESS, STFT = 13 0.25 0.25 62.55
given in Fig. 1. TESS
RAVDESS, STFT = 13 0.25 0.25 70.93
TESS,
SAVEE
RAVDESS Chroma 0.25 0.25 49.88
RAVDESS, Chroma 0.25 0.25
TESS 62.55
RAVDESS, Chroma 0.25 0.25
TESS,
SAVEE 70.93

V. CONCLUSIONS
In this paper, we have proposed the SER model. The
proposed model evaluates on the publicly available speech
emotional databases RAVDESS, TESS and SAVEE and on
their combination RAVDESS + TESS, and RAVDESS + TESS
+ SAVEE by using MFCC, Chroma, and STFT hand-crafted
features. From the experimental results, it is found that the
proposed model gives better results 82.25 % for
RAVDESS+TESS combination of databases using MFCC and
for RAVDESS+ TESS + SAVEE combination of databases
Fig 1: Proposed SER Model
model give average accuracy 77.23% using MFCC features,
IV. EXPERIMENTAL RESULTS 70.93% using STFT features and 70.93% using Chroma
features. From these results we can conclude that the proposed
The model was run on different combinations of data sets
SER model gives better results using MFCC features in case of
with various parameters like dropout value, feature extractor
combination of databases. And in case of indivisible databases,
value, learning rate along with test size. Some of them with
RAVDESS with STFT and Chroma features give 11.17% better
significant results have been shown in Table II. After
results than the MFCC features. In future research, we will
analyzing, we found that the combination of RAVDESS +
continue to work along the line to improve the application of
TESS with MFCC feature extractor with value 13 using
SER and try some merge combinations of classifiers also.
dropout 0.25 and test size 0.1 lead to better accuracy. The
detailed Experimental results are given in Table 2. REFERENCES
[1] J. Zhao, X. Mao, and L. Chen, ``Speech emotion recognition using deep
TABLE 2. Accuracy of different combinations 1D & 2D CNN LSTM networks,'' Biomed. Signal Process. Control,vol.
Dataset 47, pp. 312_323, Jan. 2019.
Accuracy
Test Size

[2] P.Yenigalla, A.Kumar, S. Tripathi, C. Singh, S. Kar, and J.Vepa,


Dropout

Feature
``Speech emotion recognition using spectrogram & phoneme
Extractor embedding,'' in Proc. Interspeech, 2018, pp. 3688_3692.
[3] E. Lakomkin, M. A. Zamani, C. Weber, S. Magg, and S. Wermter, `On
the robustness of speech emotion recognition for human-robot
RAVDESS MFCC = 13 0.25 0.25 38.71 interaction with deep neural networks,'' in Proc. IEEE/RSJ Int. Conf.
Intell. Robots Syst. (IROS), Oct. 2018, pp. 854_860.
RAVDESS, MFCC = 13 0.25 0.25 80.86 [4] S. Sahu, R. Gupta, G. Sivaraman,W. AbdAlmageed, and C. Espy-
TESS Wilson, ``Adversarial auto-encoders for speech based emotion
RAVDESS, MFCC = 13 0.50 0.25 71.59 recognition,'' 2018, arXiv:1806.02146. [Online]. Available:
TESS https://arxiv.org/abs/1806.02146
[5] S. Latif, R. Rana, S. Younis, J. Qadir, and J. Epps, ``Transfer learning
RAVDESS, MFCC = 13 0.35 0.25 76.79 for improving speech emotion classi_cation accuracy,'' 2018,
TESS arXiv:1801.06353. [Online]. Available:
RAVDESS, MFCC = 13 0.40 0.25 74.20 https://arxiv.org/abs/1801.06353
[6] M. Sarma, P. Ghahremani, D. Povey, N. K. Goel, K. K. Sarma, and N.
TESS Dehak, ``Emotion identi_cation from raw speech signals using DNNs,''
RAVDESS, MFCC = 13 0.40 0.30 74.49 in Proc. Interspeech, 2018, pp. 3097_3101.
TESS [7] J. Zhao, X. Mao, and L. Chen, ``Learning deep features to recognise
RAVDESS, MFCC = 13 0.25 0.10 82.25 speech emotion using merged deep CNN,'' IET Signal Process., vol. 12,
no. 6, pp. 713_721, 2018.
TESS [8] W. Zhang, D. Zhao, Z. Chai, L. T. Yang, X. Liu, F. Gong, and S. Yang,
RAVDESS, MFCC = 13 0.30 0.10 79.96 ``Deep learning and SVM-based emotion recognition from Chinese
TESS speech for smart affective services,'' Softw., Pract. Exper., vol. 47, no. 8,
pp. 1127_1138, 2017.
RAVDESS, MFCC = 13 0.25 0.25 77.23 [9] L. Zhu, L. Chen, D. Zhao, J. Zhou, andW. Zhang, ``Emotion recognition
TESS, from Chinese speech for smart affective services using a combination of
SAVEE SVM and DBN,'' Sensors, vol. 17, no. 7, p. 1694, 2017.

984
Authorized licensed use limited to: INTERNATIONAL INSTITUTE OF INFORMATION TECHNOLOGY. Downloaded on March 12,2025 at 14:11:02 UTC from IEEE Xplore. Restrictions apply.
2021 3rd International Conference on Advances in Computing, Communication Control and Networking (ICACCCN)

[10] H. M. Fayek, M. Lech, and L. Cavedon, ``Evaluating deep learning


architectures for speech emotion recognition,'' Neural Netw., vol. 92, pp.
60_68, Aug. 2017.
[11] Q. Mao, G. Xu, W. Xue, J. Gou, and Y. Zhan, ``Learning emotion
discriminative and domain-invariant features for domain adaptation
inspeech emotion recognition,'' Speech Commun., vol. 93, pp. 1_10, Oct.
2017.
[12] J. Deng, X. Xu, Z. Zhang, and S. Frühholz, and B. Schuller, ``Universum
autoencoder-based domain adaptation for speech emotion
recognition,''IEEE Signal Process. Lett., vol. 24, no. 4, pp. 500_504,
Apr. 2017.
[13] Y. Zhang, Y. Liu, F. Weninger, and B. Schuller, ``Multi-task deep neural
network with shared hidden layers: Breaking down the wall between
emotion representations,'' in Proc. IEEE Int. Conf. Acoust., Speech
Signal Process. (ICASSP), Mar. 2017, pp. 4990_4994.
[14] Badshah, J. Ahmad, N. Rahim, and S. W. Baik, ``Speech emotion
recognition from spectrograms with deep convolutional neural
network,''in Proc. IEEE Int. Conf. Platform Technol. Service (PlatCon),
Feb. 2017, pp. 1_5.

985
Authorized licensed use limited to: INTERNATIONAL INSTITUTE OF INFORMATION TECHNOLOGY. Downloaded on March 12,2025 at 14:11:02 UTC from IEEE Xplore. Restrictions apply.

You might also like