C14 - Speech Emotion Recognition Using Machine Learning
C14 - Speech Emotion Recognition Using Machine Learning
MACHINE LEARNING
A PROJECT REPORT
Submitted by
PERINBAN D (211417205113)
BALAJI M (211417205025)
GOPINATH D (211417205054)
HARIHARAN S J(211417205055)
of
BACHELOR OF TECHNOLOGY
in
INFORMATION TECHNOLOGY
AUGUST 2021
ANNA UNIVERSITY: CHENNAI 600 025
BONAFIDE CERTIFICATE
.D(211417205113),BALAJI.M(211417205025),GOPINATH.D(211417205054),
supervision.
SIGNATURE SIGNATURE
Dr. M. HELDA MERCY, M.E., PH.D., Ms. S.KUMARI, M.E.,
Submitted for the Project and Viva Voce Examination held on 7-8-2021
SIGNATURE SIGNATURE
i
DECLARATION
( PERINBAN D )
( BALAJI M)
(GOPINATH D )
( HARIHARAN S J )
It is certified that this project has been prepared and submitted under my
guidance.
ii
TABLE OF CONTENTS
CHAPTER PAGE
TITLE
NO NO
ABSTRACT Vii
LIST OF FIGURES Ix
LIST OF ABBREVIATIONS X
1 INTRODUCTION
2 LITERATURE SURVEY
iii
2.5 FEASIBILITY STUDY 10
3 SYSTEM DESIGN
3.1 PROPOSED SYSTEM ARCHITECTURE
14
DESIGN
4 MODULE DESIGN
4.1 SPEECH PROCESSING MODULE 34
34
4.2 PRE-PROCESSING MODULE
35
4.3 FEATURES ETRACTING MODULE
4.4 CLASSIFIER MODULE 36
4.5 EMOTION DETECTION MODULE 37
5 REQUIREMENT SPECIFICATION
iv
5.2 SOFTWARE REQUIREMENT 40
6 IMPLEMENTATION
7.1 TESTING 82
83
7.1.1 System Testing
84
7.2 TEST CASES
7.3 TEST DATA AND OUTPUT
7.3.1 Unit Testing 85
v
7.3.3 Integration Testing 86
7.5 MAINTENANCE 91
7.1 CONCLUSION 93
REFERENCES 96
vi
ABSTRACT
Our human beings speech or way of explaining is amongst the most natural way to
resorting to other communication forms like emails, messages where we often use
emojis and expression fonts to express the emotions associated with the messages. In
the life of humans emotions play a vital role in communication, the detection and
process and classify speech signals to detect emotions embedded in them. Such a
system can find use in a wide variety of application areas like interactive voice
the audio data of recordings. Emotion is an integral part of human behavior and
inherited property in all mode of communication. We, human is well trained thought
understand content based information such as information in text, audio or video but
vii
LIST OF TABLES
viii
LIST OF FIGURES
ix
LIST OF ABBREVIATIONS
ACRONYMS MEANING
TC TEST CASE
GB GIGABYTE
MB MEGABYTE
x
INTRODUCTION
1
CHAPTER 1
INTRODUCTION
Speech emotion recognition is a challenging task, and extensive reliance has been
placed on models that use audio features in building well-performing classifiers. In this
paper, we propose a novel deep dual recurrent encoder model that utilizes text data and
audio signals simultaneously to obtain a better understanding of speech data. As
emotional dialogue is composed of sound and spoken content, our model encodes the
information from audio and text sequences using dual recurrent neural networks (RNNs)
and then combines the information from these sources to predict the emotion class. This
architecture analyzes speech data from the signal level to the language level, and it thus
utilizes the information within the data more comprehensively than models that focus on
audio features. Extensive experiments are conducted to investigate the efficacy and
properties of the proposed model. Our proposed model outperforms previous state-of-
the-art methods in assigning data to one of four emotion categories (i.e., angry, happy,
sad and neutral).
The categorization of emotions has long been a hot subject of debate in different
fields of psychology, affective science, and emotion research. It is mainly based on two
popular approaches: categorical (termed discrete) and dimensional (termed continuous).
In the first approach, emotions are described with a discrete number of classes. Many
theorists have conducted studies to determine which emotions are basic . A most popular
example is a list of six basic emotions, which are anger, disgust, fear, happiness, sadness,
and surprise.
3
1.2 Need Of The Project
Communication is the key to express oneself. Humans use most part of their
body and voice to effectively communicate. Hand gestures, body language, and the
tone and temperament are all collectively used to express one’s feeling. Though the
verbal part of the communication varies by languages practiced across the globe, the
non-verbal part of communication is the expression of feeling which is most likely
common among all. Therefore, any advanced technology developed to produce a
social environment experience also covers understanding emotional context in
speech.
So to overcome these problems recognition of emotion of the speech is necessary.
In developing emotionally aware intelligence, the very first step is building robust
emotion classifiers that display good performance regardless of the application; this
outcome is considered to be one of the fundamental research goals in affective
computing . In particular, the speech emotion recognition task is one of the most
important problems in the field of paralinguistics. This field has recently broadened
its applications, as it is a crucial factor in optimal humancomputer interactions,
including dialog systems. The goal of speech emotion recognition is to predict the
emotional content of speech and to classify speech according to one of several
labels (i.e., happy, sad, neutral, and angry). First, insufficient data for training
complex neural network-based models are available, due to the costs associated
with human involvement. Second, the characteristics of emotions must be learned
from low-level speech signals. Feature-based models display limited skills when
applied to this problem. To overcome these limitations, we propose a model that
uses high-level text transcription, as well as low-level audio signals, to utilize the
information contained within low-resource datasets to a greater degree. Given recent
improvements in automatic speech recognition (ASR) technology.
4
1.3 Objective Of The Project
There are three classes of features in a speech namely, the lexical features (the
vocabulary used), the visual features (the expressions the speaker makes) and the
acoustic features (sound properties like pitch, tone, jitter, etc.).
The problem of speech emotion recognition can be solved by analysing one or more
of these features. Choosing to follow the lexical features would require a transcript of
the speech which would further require an additional step of text extraction from
speech if one wants to predict emotions from real-time audio. Similarly, going
forward with analysing visual features would require the excess to the video of the
conversations which might not be feasible in every case while the analysis on the
acoustic features can be done in real-time while the conversation is taking place as
we’d just need the audio data for accomplishing our task. Hence, we choose to
analyse the acoustic features in this work.
Speaker Identification
Speech Recognition
Speech Emotion Detection
5
1.4 Scope Of The Project
6
LITERATURE
SURVEY
7
CHAPTER 2
LITERATURE SURVEY
9
classification SVM classifiers also called as rankers, and applies it to the final
multi-class problem. Using the ranking SVM algorithm, an accuracy of 44.40%
was achieved in their system.
In the Nwe et al. [9] system, a subset of features, similar to the Mel Frequency
Cepstral Coefficients (MFCC), was used. They used the Log Frequency Power
Coefficients (LFPC) over a Hidden Markov Model (HMM) to classify emotions
in speech. Their work is not publically available, as they used a dataset privately
available to them. However, they claim that using the LFPC coefficients over the
MFFCC coefficients shows a significant improvement in terms of the accuracy
of the model. The average classification accuracy in their model is 69%.
10
dataset (Devillers et al., 2002). A balanced dataset was used for testing but not
for training and lexical cues were included in the analysis. A recognition rate of
59.8% was achieved for four emotions in the CEMO corpus (Devillers and
Vasilescu, 2006), with lexical cues again included in the analysis. Using the
“How May I Help You” dataset and four groups of features – lexical, prosodic,
dialog-act, and contextual – the recognition rate achieved for seven emotion was
79%. However, 73.1% of the instances were labeled as non-negative in the
dataset, producing a recognition baseline of 73.1% for random guessing
(Liscombe et al., 2005).
11
SYSTEM
DESIGN
12
CHAPTER 3
SYSTEM DESIGN
13
3.1 PROPOSED SYSTEM ARCHITECTURE
14
information about whether processes will operate in a sequence or in parallel
unlike a flowchart which also shows this information.
Data flow diagrams are also known as bubble charts. DFD is a designing
tool used in the top-down approach to Systems Design. This context-level DFD
is next "exploded", to produce a Level 1 DFD that shows some of the detail of
the system being modeled. The Level 1 DFD shows how the system is divided
into sub-systems (processes), each of which deals with one or more of the data
flows to or from an external agent, and which together provide all of the
functionality of the system as a whole. It also identifies internal data stores that
must be present in order for the system to do its job, and shows the flow of data
between the various parts of the system.
Data flow diagrams are one of the three essential perspectives of the structured-
systems analysis and design method SSADM. The sponsor of a project and the
end users will need to be briefed and consulted throughout all stages of a
system's evolution. With a data flow diagram, users are able to visualize how
the system will operate, what the system will accomplish, and how the system
will be implemented.
LEVEL 0:
User Gender
LEVEL 1:
PreProcessing
15
LEVEL2:
Feature Extraction
Graphical Result Datasets
Re
16
describes a sequence of action that provides something of measurable value to
an action and is drawn as a horizontal ellipse
17
We can also use the terms event diagrams or event scenarios to refer to
a sequence diagram. Sequence diagrams describe how and in what order the
objects in a system function.
18
operations of a class and also the constraints imposed on the system. The class
diagrams are widely used in the modeling of objectoriented systems because
they are the only UML diagrams, which can be mapped directly with object-
oriented languages.
19
exchange among the objects within the collaboration to achieve a desired
outcome.
20
diagram portrays the control flow from a start point to a finish point showing
the various decision paths that exist while the activity is being executed.
21
Fig 3.8 Block Diagram
22
SER is nothing but the pattern recognition system. This shows that the stages
that are present in the pattern recognition system are also present in the Speech
emotion recognition system. The speech emotion recognition system contains
five main modules emotional speech input, feature extraction, feature selection,
classification, and recognized emotional output [2].
23
There is a pattern recognition system stage in speech emotion recognition
system that makes them both same [22]. Energy, MFCC, Pitch like derived
speech features patterns are mapped using various classifiers. It consists of five
main modules are:
Speech input: Input to the system is speech taken with the help of microphone
audio. Then equivalent digital representation of received audio is produced
through pc sound card.
Feature extraction and selection:
There are 300 emotional states of emotion and emotion relevance is used to
select the extracted speech features. For speech feature extraction to selection
corresponding to emotions all procedure revolves around the speech signal.
Classification: Finding a set of significant emotions for classification is the
main concern in speech emotion recognition system. There are 300 emotional
states contains in a typical set of emotions that makes classification a
complicated task .
Recognized emotional output: Fear, surprise, anger, joy, disgust and sadness
are primary emotions and naturalness of database level is the basis for speech
emotion recognition system evaluation.
A typical set of emotions contains 300 emotional states. Therefore to classify
such a great number of emotions is very complicated. According to „Palette
theory‟ any emotion can be decomposed into primary emotions similar to the
way that any color is a combination of some basic colors. Primary emotions are
anger, disgust, fear, joy, sadness and surprise. The evaluation of the speech
emotion recognition system is based on the level of naturalness of the database
which is used as an input to the speech emotion recognition system. If the
inferior database is used as an input to the system then incorrect conclusion may
be drawn. The database as an input to the speech emotion recognition system
24
may contain the real world emotions or the acted ones. It is more practical to
use database that is collected from the real life situations
LIST OF MODULES:
1.Voice Input
In this module, the user have to speak up to the mic after pressing the
speak button .It will start receiving the user’s voice.
2.Voice To Text
In the second module,After receiving the voice,the MFCC ,LPCC and PLP
Features are performed on the voice to assure the normal hearable
frequencies.Then the voice will be converted to text with the help of Google
API Speech to Text.
3.Analyzing Texts extracted
In the third module,the results of the previous module Will be i.e. the
converted texts are analyzed with the customized datasets.
4.Graphical Result
In the Final module, After comparing the texts with the datasets,
a graphical based result will be displayed showing whether the emotion
Is anger, happy, neutral, etc.
TYPES OF SPEECH:
On the basis of ability they have to recognize a speech recognition systems can
be separated in different classes . Following are the classification:
Isolated words: In this type of recognizers sample window both sides contains
low pitch utterance. At a time only single word or utterance is accepted by it
and there is need to wait between utterances by speaker as these systems have
listen/non-listen states. For this class isolated utterance is a better name.
Connected words: In this separate utterance can run together with minimal
pause between them otherwise it is similar to isolated words.
25
Continuous words: It allows users to speak naturally and content are
determined by computer. Creation of recognizers that have continuous speech
capabilities are difficult due to determination of utterance boundaries by
utilizing a special method.
Spontaneous words: It can be thought of as speech at basic level that is natural
sounding and not rehearsed. Variety of natural speech features are handle is the
ability of spontaneous speech with ASR system.
26
The features should be informative to the context. Only those features that are
more descriptive about the emotional content are to be selected for further
analysis.
The features should be consistent across all data samples. Features that are
unique and specific to certain data samples should be avoided.
The values of the features should be processed. The initial feature selection
process can result in a raw feature vector that is unmanageable. The process of
Feature Engineering will remove any outliers, missing values, and null values.
The features in a speech percept that is relevant to the emotional content can be
grouped into two main categories:
1. Prosodic features
2. Phonetic features.
The prosodic features are the energy, pitch, tempo, loudness, formant, and
intensity. The phonetic features are mostly related to the pronunciation of the
words based on the language. Therefore for the purpose of emotion detection,
the analysis is performed on the prosodic features or a combination of them.
Mostly the pitch and loudness are the features that are very relevant to the
emotional content.
To extract speech information from audio signals, we use MFCC values, which
are widely used in analyzing audio signals. The MFCC feature set contains a
total of 39 features, which include 12 MFCC parameters (1-12) from the 26
Melfrequency bands and log-energy parameters, 13 delta and 13 acceleration
coefficients The frame size is set to 25 ms at a rate of 10 ms with the Hamming
function. According to the length of each wave file, the sequential step of the
MFCC features is varied. To extract additional information from the data, we
also use prosodic features, which show effectiveness in affective computing.
The prosodic features are composed of 35 features, which include the F0
frequency, the voicing probability, and the loudness contours. All of these
27
MFCC and prosodic features are extracted from the data using the OpenSMILE
toolkit .
28
mean and skewness), segments (duration, number) or extremes values (max,
min, range).
Feature selection: To describe phenomenon from a larger set of redundant or
irrelevant features is a subset of features selected by feature selection. Feature
selection is done to improve the accuracy and performance of classifier [20].
Wrapper based selection methods are generally used approaches that employ an
accuracy of target classifier as optimization criterion in a closed loop fashion
[26]. In this features with poor performance are neglected. Hill climbing,
sequential forward search is commonly chosen procedure with a sequentially
adding and empty set. These features give performances improvement. Selected
subset of features effects are ignored by use of filter methods which is a second
general approach. Reduced features sets obtained from the acted and non-acted
emotions difference is very less.
There are number of methods for feature extraction like Linear predictive
cepstral coefficients (LPCC), Power spectral analysis (FFT), First order
derivative (DELTA), Linear predictive analysis (LPC), Mel scale cepstral
analysis (MEL), perceptual linear predictive coefficients (PLP) and Relative
spectra filtering of log domain coefficients (RASTA) .
Linear predictive coding (LPC): In encoding quality speech at a low bit rate
LPC method is useful that is one of the most powerful techniques of speech
analysis. At current time specific speech sample can be approximated as a linear
combination of past speech samples is the basic idea behind linear predictive
analysis. It is a human speech production base model that utilizes a
conventional source filter model. Vocal tract acoustics are simulated by Lip
radiation, vocal tract and glottal transfer functions that are integrated into one
all pole filter. Over a finite duration the sum of squared differences between
estimated and original speech signal is minimized using LPC that helps in
having unique sets of predictor coefficients. In real recognition actual predictor
coefficients are not used as a high variance is shown by it. There is
29
transformation of predictor coefficient to a cepstral coefficients more robust set
of parameters. Some of the types of LPC are residual excitation, regular pulse
excited, pitch excitation, voice excitation and coded excited LPC.
Mel frequency cepstral coefficients (MFCC): It is considered as one of the
standard method for feature extraction and in ASR most common is the use of
20 MFCC coefficients. Although for coding speech use of 10-12 coefficients are
sufficient and it depend on the spectral form due to which it is more sensitive to
noise. This problem can be overcome by using more information in speech
signals periodicity although aperiodic content is also present in speech. Real
cesptal of windowed short time fast Fourier transform (FFT) signal is represent
by MFCC [21]. Non linear frequency is use. The parameters similar to humans
used for hearing speech are used to extracts parameters using audio feature
extraction MFCC technique. Other information is deemphasizes and arbitrary
number of samples contain time frames are used to divide speech signals.
Overlapping from frame to frame is used to smooth the transition in most
systems and then hamming window is used to eliminate the discontinuities from
each time frame.
Mel-frequency cepstral coefficients (MFCCs, [154]) are a parametric
representation of the speech signal, that is commonly used in automatic speech
recognition, but they have proved to be successful for other purposes as well,
among them speaker identification and emotion recognition. MFCCs are
calculated by applying a Mel-scale filter bank to the Fourier transform of a
windowed signal. Subsequently, a DCT (discrete cosine transform) transforms
the logarithmised spectrum into a cepstrum. The MFCCs are then the
amplitudes of the cepstrum. Usually, only the first 12 coefficients are used.
Through the mapping onto the Mel-scale, which is an adaptation of the Hertz-
scale for frequency to the human sense of hearing, MFCCs enable a signal
representation that is closer to human perception. MFCCs filter out pitch and
other influences in speech that are not linguistically relevant, hence they are
30
very suitable for speech recognition. Though this should make them useless for
emotion
Mel Frequency Cepstrum Coefficients (MFCC) FEATURES A subset of
features that are used for speech emotion detection is grouped under a category
called the Mel Frequency Cepstrum Coefficients (MFCC) [16]. It can be
explained as follows:
The word Mel represents the scale used in Frequency vs Pitch measurement .
The value measured in frequency scale can be converted into Mel scale using
the formula m = 2595 log10 (1 + (f/700))
The word Cepstrum represents the Fourier Transform of the log spectrum of
the speech signal.
Perceptual linear prediction (PLP): Hermansky developed a PLP model that
uses psychophysics concept of hearing to model a human speech. The speech
recognition rate gets improved by discarding irrelevant information by PLP.
Spectral characteristics are transformed to human auditory system match is the
only thing that makes PLP different from LPC. The intensity-loudness power-
law relation, equalloudness curve and critical-band resolution curves are three
main perceptual aspects approximates by PLP. Mel scale cepstral analysis
(MEL): PLP analysis and MEL analysis is similar to each other in which
psychophysically based spectral transformations is used to modify the spectrum.
According to the scale of MEL a spectrum is wrapped in this method on other
hand according to bark scale a spectrum is warped in PLP. So output cepstral
coefficients are the main different between scale cepstral analysis of PLP and
MEL. The modified power spectrum is smooth using all pole model in PLP and
then on the basis of this model a output cepstral coefficients are computed. On
other hand modified power spectrum is smooth using cepstral smoothing in
MEL scale cesptral analysis. In this Discrete Fourier Transform (DFT) is used
to convert log power spectrum is directly transform into capstal domain.
31
Relative Spectra filtering (RASTA): The ability to perform RASTA filtering
is provided by analysis library to compensate for linear channel distortions. It
can be used either in cepstral or log spectral domains and in both of them linear
channel distortions is appear as an additive constant. Each feature coefficient is
band passes by RASTA filter and convolutional introduced noise in the channel
effect is alleviated by band pass filter equivalent high pass portion. Then frame
to frame spectral changes are smoothened with the help of low pass filtering.
Power spectral analysis (FFT): This is the more common techniques of
studying speech signal and over the frequency content of the signal over time is
described by speech signal power spectrum. Discrete Fourier Transform (DFT)
of the speech signal is the first step to compute power spectrum that computes
time domain signal equivalent frequency information. Real point values consist
speech signal can use Fast Fourier Transform (FFT) to increase the efficiency.
32
MODULE DESIGN
33
CHAPTER 4
MODULE DESIGN
In this module 1,the voice on which will be processed must be given here.
The user can start speaking after pressing the mike like button.
It is much important to specify the gender of the speaker,whether male or
female before starting to speak.
34
In this module 2,pre-processing will be completed.
The preprocessing include silence removal, pre-emphasis, normalization
and windowing so it is an important phase to get pure signal which is
used in the next stage (feature extraction).
The discrimination between speech and music files was performed
depend on a comparative between more than one statistical indicator such
as mean, standard deviation, energy and silence interval.
The speech signal usually include many parts of silence. The silence
signal is not important because it is not contain information. There are
several methods to remove these parts such as zero crossing rate
(ZCR) and short time energy (STE). Zero-crossing rate is a measure
of number of times in a given time interval such that the amplitude of the
speech signals passes through a value of zero.
Feature
Output Classification
Selection
FEATURES COMPRISES OF
35
Mel frequency cepstral coefficients (MFCC): It is considered as one of
the standard method for feature extraction and in ASR most common is the use
of 20 MFCC coefficients. Although for coding speech use of 10-12 coefficients
are sufficient and it depend on the spectral form due to which it is more
sensitive to noise. This problem can be overcome by using more information in
speech signals periodicity although aperiodic content is also present in speech.
Real cesptal of windowed short time fast Fourier transform (FFT) signal is
represent by MFCC [21]. Non linear frequency is use. The parameters similar to
humans used for hearing speech are used to extracts parameters using audio
feature extraction MFCC technique. Other information is deemphasizes and
arbitrary number of samples contain time frames are used to divide speech
signals. Overlapping from frame to frame is used to smooth the transition in
most systems and then hamming window is used to eliminate the discontinuities
from each time frame.
36
A set of 26 features was selected by statistical method and Multilayer
Perception, Probabilistic Neural Networks and Support Vector Machine
were used for the Emotion Classification at seven classes: anger,
happiness, anxiety/fear, sadness, boredom,disgust and neutral.
Energy and Formants and were evaluated in order to create a feature set
sufficient to discriminate between seven emotions in acted speech.
This is the last and final module of the system.Here the feature extracted
audio will be compared to our locally customized data sets.
We have a huge quantity of customized datasets to make sure that no
emotion is missing out that easily,
So, After comparing the audio with the customized datasets,the suitable
or perfectly matched emotion will be found.
37
The founded emotion will be displayed to the user in a very easily
gettable graphical format.
38
REQUIREMENT
SPECIFICATION
39
CHAPTER 5
REQUIREMENT SPECIFICATION
Ram : Minimum 2 Gb
Moniter : 15 inch
INTEGRATED DEVELOPMENT
ENVIRONMENT : PYCHARM
It is used for:
40
system scripting.
FEATURES OF PYTHON:
Good to know
41
In this tutorial Python will be written in a text editor. It is possible to
write Python in an Integrated Development Environment, such as
Thonny, Pycharm, Netbeans or Eclipse which are particularly useful
when managing larger collections of Python files.
Python was designed for readability, and has some similarities to the
English language with influence from mathematics.
Python uses new lines to complete a command, as opposed to other
programming languages which often use semicolons or parentheses.
Python relies on indentation, using whitespace, to define scope; such as
the scope of loops, functions and classes. Other programming languages
often use curly-brackets for this purpose.
PYCHARM
42
FEATURES OF PYCHARM
Coding assistance and analysis, with code completion, syntax and error
highlighting, linter integration, and quick fixes
Project and code navigation: specialized project views, file structure views
and quick jumping between files, classes, methods and usages
Python refactoring: includes rename, extract method, introduce variable,
introduce constant, pull up, push down and others
Support for web frameworks: Django, web2py and Flask [professional
edition only][8]
Integrated Python debugger
Integrated unit testing, with line-by-line code coverage
Google App Engine Python development
Version control integration: unified user interface
for Mercurial, Git, Subversion, Perforce and CVS with change lists and
merge
Support for scientific tools like matplotlib, numpy and scipy.
The next step after data collection was to represent these audio files
numerically, in order to perform further analysis on them. This step is called
feature extraction, where quantitative values for different features of the audio is
obtained. The pyAudioAnalysis library was used for this purpose . This python
library provides functions for short-term feature extraction, with tunable
windowing parameters such as frame size and frame step. At the end of this
step, each audio file was represented as a row in a CSV file with 34 columns
43
representing the different features. Each feature will have a range of values for
one audio file obtained over the various frames in that audio signal.
Numpy
Matplotlib
Keras
Tensor flow
Hmmlearn
Simplejson
pydub
NUMPY:
At the core of the NumPy package, is the ndarray object. This encapsulates n-
dimensional arrays of homogeneous data types, with many operations being
performed in compiled code for performance. There are several important
differences between NumPy arrays and the standard Python sequences:
44
NumPy arrays have a fixed size at creation, unlike Python lists (which can
grow dynamically). Changing the size of an ndarray will create a new
array and delete the original.
The elements in a NumPy array are all required to be of the same data type,
and thus will be the same size in memory. The exception: one can have arrays
of (Python, including NumPy) objects, thereby allowing for arrays of
different sized elements.
NumPy arrays facilitate advanced mathematical and other types of
operations on large numbers of data. Typically, such operations are
executed more efficiently and with less code than is possible using Python's
built-in sequences.
A growing plethora of scientific and mathematical Python-based packages
are using NumPy arrays; though these typically support Python-sequence
input, they convert such input to NumPy arrays prior to processing, and they
often output NumPy arrays.
MATPLOTLIB:
45
Learn more about Matplotlib through the many external learning
resources.
Keras:
Keras is a deep learning API written in Python, running on top of the machine
learning platform TensorFlow. It was developed with a focus on enabling fast
experimentation. Being able to go from idea to result as fast as possible is key
to doing good research.
46
IMPLEMENTATION
47
CHAPTER 6
IMPLEMENTATION
WORKSPACE.XML
48
value="true"/><property name="last_opened_file_path"
value="$PROJECT_DIR$/testing.py"/><property
name="settings.editor.selected.configurable"
value="com.jetbrains.python.configuration.PyActiveSdkModuleConfigura
ble"/></component><component name="RecentsManager"><key
name="CopyFile.RECENT_KEYS"><recent name="D:\Projects\Sentiment-
Analysis-master"/></key></component><component name="RunManager"
selected="Python.testing"><configuration name="Speech rec tsmil "
nameIsGenerated="true" temporary="true" factoryName="Python"
type="PythonConfigurationType"><module name="Sentiment-Analysis-
master"/><option name="INTERPRETER_OPTIONS" value=""/><option
name="PARENT_ENVS" value="true"/><envs><env
name="PYTHONUNBUFFERED" value="1"/></envs><option
name="SDK_HOME" value=""/><option
name="WORKING_DIRECTORY"
value="$USER_HOME$/pythonProject"/><option
name="IS_MODULE_SDK" value="false"/><option
name="ADD_CONTENT_ROOTS" value="true"/><option
name="ADD_SOURCE_ROOTS" value="true"/><option
name="SCRIPT_NAME" value="$USER_HOME$/pythonProject/Speech
rec tsmil .py"/><option name="PARAMETERS" value=""/><option
name="SHOW_COMMAND_LINE" value="false"/><option
name="EMULATE_TERMINAL" value="false"/><option
name="MODULE_MODE" value="false"/><option
name="REDIRECT_INPUT" value="false"/><option name="INPUT_FILE"
value=""/><method v="2"/></configuration><configuration name="main -
Copy" nameIsGenerated="true" temporary="true" factoryName="Python"
type="PythonConfigurationType"><module name="Sentiment-Analysis-
master"/><option name="INTERPRETER_OPTIONS" value=""/><option
49
name="PARENT_ENVS" value="true"/><envs><env
name="PYTHONUNBUFFERED" value="1"/></envs><option
name="SDK_HOME" value=""/><option
name="WORKING_DIRECTORY" value="$PROJECT_DIR$"/><option
name="IS_MODULE_SDK" value="true"/><option
name="ADD_CONTENT_ROOTS" value="true"/><option
name="ADD_SOURCE_ROOTS" value="true"/><option
name="SCRIPT_NAME" value="$PROJECT_DIR$/main -
Copy.py"/><option name="PARAMETERS" value=""/><option
name="SHOW_COMMAND_LINE" value="false"/><option
name="EMULATE_TERMINAL" value="false"/><option
name="MODULE_MODE" value="false"/><option
name="REDIRECT_INPUT" value="false"/><option name="INPUT_FILE"
value=""/><method v="2"/></configuration><configuration name="main"
nameIsGenerated="true" temporary="true" factoryName="Python"
type="PythonConfigurationType"><module name="Sentiment-Analysis-
master"/><option name="INTERPRETER_OPTIONS" value=""/><option
name="PARENT_ENVS" value="true"/><envs><env
name="PYTHONUNBUFFERED" value="1"/></envs><option
name="SDK_HOME"
value="C:\Users\thiru\AppData\Local\Microsoft\WindowsApps\python3.7.
exe"/><option name="WORKING_DIRECTORY"
value="$PROJECT_DIR$"/><option name="IS_MODULE_SDK"
value="false"/><option name="ADD_CONTENT_ROOTS"
value="true"/><option name="ADD_SOURCE_ROOTS"
value="true"/><option name="SCRIPT_NAME"
value="$PROJECT_DIR$/main.py"/><option name="PARAMETERS"
value=""/><option name="SHOW_COMMAND_LINE"
value="false"/><option name="EMULATE_TERMINAL"
50
value="false"/><option name="MODULE_MODE" value="false"/><option
name="REDIRECT_INPUT" value="false"/><option name="INPUT_FILE"
value=""/><method v="2"/></configuration><configuration name="testing"
nameIsGenerated="true" temporary="true" factoryName="Python"
type="PythonConfigurationType"><module name="Sentiment-Analysis-
master"/><option name="INTERPRETER_OPTIONS" value=""/><option
name="PARENT_ENVS" value="true"/><envs><env
name="PYTHONUNBUFFERED" value="1"/></envs><option
name="SDK_HOME" value=""/><option
name="WORKING_DIRECTORY" value="$PROJECT_DIR$"/><option
name="IS_MODULE_SDK" value="true"/><option
name="ADD_CONTENT_ROOTS" value="true"/><option
name="ADD_SOURCE_ROOTS" value="true"/><option
name="SCRIPT_NAME" value="$PROJECT_DIR$/testing.py"/><option
name="PARAMETERS" value=""/><option
name="SHOW_COMMAND_LINE" value="false"/><option
name="EMULATE_TERMINAL" value="false"/><option
name="MODULE_MODE" value="false"/><option
name="REDIRECT_INPUT" value="false"/><option name="INPUT_FILE"
value=""/><method v="2"/></configuration><configuration name="train"
nameIsGenerated="true" temporary="true" factoryName="Python"
type="PythonConfigurationType"><module name="Sentiment-Analysis-
master"/><option name="INTERPRETER_OPTIONS" value=""/><option
name="PARENT_ENVS" value="true"/><envs><env
name="PYTHONUNBUFFERED" value="1"/></envs><option
name="SDK_HOME" value=""/><option
name="WORKING_DIRECTORY" value="$PROJECT_DIR$/../Face-
Recognition-Based-Attendance-System-master"/><option
name="IS_MODULE_SDK" value="false"/><option
51
name="ADD_CONTENT_ROOTS" value="true"/><option
name="ADD_SOURCE_ROOTS" value="true"/><option
name="SCRIPT_NAME" value="$PROJECT_DIR$/../Face-Recognition-
Based-Attendance-System-master/train.py"/><option
name="PARAMETERS" value=""/><option
name="SHOW_COMMAND_LINE" value="false"/><option
name="EMULATE_TERMINAL" value="false"/><option
name="MODULE_MODE" value="false"/><option
name="REDIRECT_INPUT" value="false"/><option name="INPUT_FILE"
value=""/><method v="2"/></configuration><recent_temporary><list><item
itemvalue="Python.testing"/><item itemvalue="Python.Speech rec tsmil
"/><item itemvalue="Python.main - Copy"/><item
itemvalue="Python.main"/><item
itemvalue="Python.train"/></list></recent_temporary></component><compo
nent name="SpellCheckerSettings" transferred="true"
UseSingleDictionary="true" DefaultDictionary="application-level"
CustomDictionaries="0" Folders="0" RuntimeDictionaries="0"/><component
name="TaskManager"><task id="Default" summary="Default task"
active="true"><changelist name="Default Changelist" comment=""
id="e26abddc-b29c-45a3-99d2-
743fbf23056f"/><created>1612463823446</created><option name="number"
value="Default"/><option name="presentableId"
value="Default"/><updated>1612463823446</updated></task><servers/></co
mponent></project>.
MODULES.XML
52
<project version="4"><component
name="ProjectModuleManager"><modules><module
filepath="$PROJECT_DIR$/.idea/Sentiment-Analysis-master.iml"
fileurl="file://$PROJECT_DIR$/.idea/Sentiment-Analysis-master.iml"/>
</modules>
</component>
</project>
MISC.XML
PROFILE.XML
<?xml version="1.0"?>
<component name="InspectionProjectProfileManager"><settings><option
name="USE_PROJECT_PROFILE" value="false"/><version
value="1.0"/></settings></component>
PYTHON CODE:
PROJECT.PY
53
from tkinter import messagebox
import string
import speech_recognition as sr
tkWindow = Tk()
tkWindow.geometry('400x150')
tkWindow.title('SPEECH RECOGNITION')
var = StringVar()
def showMsg():
r = sr.Recognizer()
text=''
audio = r.listen(source)
try:
text = r.recognize_google(audio)
except:
54
# text = open("read1.txt", encoding="utf-8").read()
# converting to lowercase
lower_case = text.lower()
# Removing punctuations
tokenized_words = cleaned_text.split()
"these",
"do",
"does", "did", "doing", "a", "an", "the", "and", "but", "if", "or",
"because", "as", "until", "while",
55
"after", "above", "below", "to", "from", "up", "down", "in", "out",
"on", "off", "over", "under",
"again",
"each",
"than",
"too", "very", "s", "t", "can", "will", "just", "don", "should", "now"]
final_words = []
final_words.append(word)
emotion_list = []
if word in final_words:
emotion_list.append(emotion)
56
labeltext = "You Said :" + text
var.set(labeltext)
label.pack()
print(emotion_list)
w = Counter(emotion_list)
print(w)
ax1.bar(w.keys(), w.values())
fig.autofmt_xdate()
plt.savefig('graph.png')
plt.show()
button = Button(tkWindow,
text='Speak',
command=showMsg)
button.pack()
tkWindow.mainloop()
57
MAIN.PY
import tkinter as tk
import string
import speech_recognition as sr
tkWindow = Tk()
tkWindow.geometry('500x450')
tkWindow.title('SPEECH RECOGNITION')
tkWindow.configure(bg='blue')
var = StringVar()
def speak():
tkWindow1 = Toplevel()
tkWindow1.geometry('400x150')
var2 = StringVar()
photo = PhotoImage(file=r"mic.png")
58
photoimage = photo.subsample(6, 6)
button = Button(tkWindow1,
text='Speak',
image=photoimage,
command=showMsg).pack(side = TOP)
tkWindow1.mainloop()
def gen():
tkWindow2 = Toplevel()
tkWindow2.geometry('400x150')
var1 = StringVar()
label1.pack()
button = Button(tkWindow2,
text='MALE',
command=speak).pack(side=TOP)
button = Button(tkWindow2,
text='FEMALE',
command=speak).pack(side=TOP)
tkWindow.mainloop()
59
def showMsg():
r = sr.Recognizer()
text=''
audio = r.listen(source)
try:
text = r.recognize_google(audio)
except:
# converting to lowercase
lower_case = text.lower()
# Removing punctuations
tokenized_words = cleaned_text.split()
60
"yourselves", "he", "him", "his", "himself", "she", "her", "hers", "herself", "it",
"its", "itself",
"they", "them", "their", "theirs", "themselves", "what", "which", "who",
"whom", "this", "that", "these",
"those", "am", "is", "are", "was", "were", "be", "been", "being",
"have", "has", "had", "having",
"do",
"does", "did", "doing", "a", "an", "the", "and", "but", "if", "or",
"because", "as", "until", "while",
"of", "at", "by", "for", "with", "about", "against", "between", "into",
"through", "during", "before",
"after", "above", "below", "to", "from", "up", "down", "in", "out",
"on", "off", "over", "under",
"again",
"further", "then", "once", "here", "there", "when", "where", "why",
"how", "all", "any", "both",
"each",
"few", "more", "most", "other", "some", "such", "no", "nor", "not",
"only", "own", "same", "so",
"than",
"too", "very", "s", "t", "can", "will", "just", "don", "should", "now"]
final_words = []
final_words.append(word
61
for line in file:
if word in final_words:
emotion_list.append(emotion)
var.set(labeltext)
label.pack()
#print(emotion_list)
w = Counter(emotion_list)
#print(w)
ax1.bar(w.keys(), w.values())
fig.autofmt_xdate()
plt.savefig('graph.png')
plt.show()
label.pack()
button = Button(tkWindow,
62
command=gen)
button.pack()
tkWindow.mainloop()
MAIN.NLKTR.PY
import string
lower_case = text.lower()
final_words = []
63
final_words.append(word)
lemma_words = []
word = WordNetLemmatizer().lemmatize(word)
lemma_words.append(word)
emotion_list = []
if word in lemma_words:
emotion_list.append(emotion)
print(emotion_list)
w = Counter(emotion_list)
print(w)
def sentiment_analyse(sentiment_text):
score = SentimentIntensityAnalyzer().polarity_scores(sentiment_text)
print("Negative Sentiment")
64
print("Positive Sentiment")
else:
print("Neutral Sentiment")
sentiment_analyse(cleaned_text)
ax1.bar(w.keys(), w.values())
fig.autofmt_xdate()
plt.savefig('graph.png')
plt.show()
SPEECH ANALYS.PY
import string
def get_tweets():
tweetCriteria = got.manager.TweetCriteria().setQuerySearch('Dhoni') \
.setSince("2020-01-01") \
.setUntil("2020-04-01") \
.setMaxTweets(1000)
tweets = got.manager.TweetManager.getTweets(tweetCriteria)
65
# Creating list of chosen tweet data
return text_tweets
text = ""
text_tweets = get_tweets()
length = len(text_tweets)
# converting to lowercase
lower_case = text.lower()
# Removing punctuations
tokenized_words = cleaned_text.split()
66
"those", "am", "is", "are", "was", "were", "be", "been", "being", "have",
"has", "had", "having", "do",
"does", "did", "doing", "a", "an", "the", "and", "but", "if", "or",
"because", "as", "until", "while",
"after", "above", "below", "to", "from", "up", "down", "in", "out", "on",
"off", "over", "under", "again",
"too", "very", "s", "t", "can", "will", "just", "don", "should", "now"]
emotion_list = []
if word in final_words:
emotion_list.append(emotion)
67
w = Counter(emotion_list)
print(w)
ax1.bar(w.keys(), w.values())
fig.autofmt_xdate()
plt.savefig('graph.png')
plt.show()
6.2 DATASETS
The size of the dataset is large enough for the model to be trained effectively.
The more exposure to data given to a model helps it to perform better.
The audio files are mono signals, which ensures an error-free conversion with
most of the programming libraries.
Dataset We evaluate our model using the Interactive Emotional Dyadic Motion
Capture (IEMOCAP) dataset. This dataset was collected following theatrical
theory in order to simulate natural dyadic interactions between actors. We use
categorical evaluations with majority agreement. We use only four emotional
categories happy, sad, angry, and neutral to compare the performance of our
model with other research using the same categories. The IEMOCAP dataset
includes five sessions, and each session contains utterances from two speakers
(one male and one female). This data collection process resulted in 10 unique
speakers. For consistent comparison with previous work, we merge the
69
excitement dataset with the happiness dataset. The final dataset contains a total
of 5531 utterances (1636 happy, 1084 sad, 1103 angry, 1708 neutral)
Coding procedure
We developed our own software for the coding of the emotions to take
advantage of the precise timings of the word onsets that our transcription
offered. The program, written using MATLAB, allows the coder to watch the
video recording of the couple while listening to the session, at the same time
viewing the text transcript for each participant. The coder determines an
emotion category and an intensity level (low, medium, high) of that emotion.
(In the analysis reported in this paper, we did not differentiate between the
intensity levels.) A coder estimates the time, t0, at which an emotion begins,
and the time, t1, at which an emotion ends. Although data were recorded every
millisecond, we did not expect the accuracy of t0 or t1, to be at this level. The
association of a word with an emotion code proceeds as follows {Anger, 7
Sadness, Joy, Tension, Neutral}. If at a time tn a coding is set for Ci and at time
tn+1 a coding is set for emotion Cj different from Ci, then any word with an
onset in the interval [tn,tn+1] is automatically coded as Ci, and any word with
an onset immediately after tn+1 is coded as Cj. We do not allow two emotions
70
to overlap and every word occurrence (or token) is coded with one and only one
emotion or Neutral. In the rest of this paper we talk about emotion-coded word
tokens orjust emotion-coded tokens. They refer to the segments of the acoustic
signal associated with the word tokens and labeled with one of the four
emotions or Neutral. Transformations of these segments are the observations
that are used in the machine-learning classification model. It is well recognized
by most investigator that it is very expensive and time consuming to have the
coding of the temporal length of emotion as an individual human coder’s
responsibility. The need for automated programming to do such coding is
essential in the future to reduce cost.
In the field of affect detection, a very important role is played by suitable choice
of speech database. Three databases are used for good emotion recognition the
system as given below [8]:
2. Actor based speech database: Trained and professional artists collect this type
of speech dataset.
71
Advantage: In this database-wide variety of emotions are present and
collecting it is also very easy.
3. Natural speech database: Real world data is used to create this database.
Advantage: For real world emotion recognition use of natural speech database
is very useful.
Problem: It consist of background noise and all emotions may not be present
in it.
EMOTION DATASETS
'bored': 'bored', 'brave': 'fearless', 'bright': 'happy', 'brisk': 'happy', 'calm': 'safe',
73
'disconsolate': 'sad', 'discontented': 'sad', 'discounted': 'belittled',
74
'frustrated': 'angry', 'full of anticipation': 'attracted', 'full of ennui': 'apathetic',
'in a huff': 'angry', 'in a stew': 'angry', 'in control': 'adequate', 'in fear': 'fearful',
'in pain': 'sad', 'in the dumps': 'sad', 'in the zone': 'focused', 'incensed': 'angry',
'jocular': 'happy', 'jolly': 'happy', 'jovial': 'happy', 'joyful': 'happy', 'joyless': 'sad',
75
'joyous': 'happy', 'jubilant': 'happy', 'justified': 'singled out', 'keen': 'attracted',
'labeled': 'singled out', 'lackadaisical': 'bored', 'lazy': 'apathetic', 'left out': 'hated',
'lonesome': 'alone', 'lost': 'lost', 'loved': 'attached', 'low': 'sad', 'lucky': 'happy',
76
'reassured': 'fearless', 'reckless': 'powerless', 'redeemed': 'singled out',
'trapped': 'entitled', 'tremulous': 'fearful', 'turned on': 'lustful', And much more.
77
6.3 SAMPLE SCREEN SHOTS
78
Fig 6.2 Voice Captured
79
Fig 6.4 Voice Analyzed
80
TESTING
AND
MAINTENANCE
81
CHAPTER 7
TESTING AND MAINTENANCE
7.1 TESTING
Implementation forms an important phase in the system development
life cycle. It is a stage of the project work that transforms the design into a
model. Testing was done to see if all the features provided in the modules are
performing satisfactory and to ensure that the process of testing is as realistic as
possible.
Each program is tested individually at the time of development using the data
and has verified that this program linked together in the way specified in the
program specification, the computer system and its environment is tested to the
satisfaction of the user. The system that has been developed is accepted and
proved to be satisfactory for the user. And so the system is going to be
implemented very soon.
Initially as a first step the executable form of the application is to be created and
loaded in the common server machine which is accessible to all the users and
the server is to be connected to a network. The final stage is to document the
entire system which provides components and the operating procedures of the
system.
The importance of software testing and its implementations with respect to
software quality cannot be over emphasized. Software testing is a critical
element of software quality assurance and represents the ultimate review of
specification, design and coding. Any product can be tested using either a black
box testing or white box testing. Further testing can be implemented along the
lines of code, integration and system testing.
82
Fig 7.1 Levels of Testing
83
7.2 TEST CASES
TC_01 Speaking after Voice Voice must get Voice gets Pass
clicking the mic recorded recorded.
button.
TC_02 Choosing the Gender Specified Gender is pass
gender gender must be specified
chosen. correctly.
TC_03 Feature Voice Features from Features are pass
Extraction Stage voice must be extracted.
extracted
TC_04 Comparing with Text The text must Got matched Pass
the datasets. match any data. against a
data.
TC_05 Results will be Text Graphical Correct Pass
shown result of emotion is
emotion will be displayed in
displayed. graph form.
84
7.3.1 UNIT TESTING
Functional test cases involved exercising the code with nominal input
values for which the expected results are known, as well as boundary values and
special values, such as logically related inputs, files of identical elements, and
empty files.
Performance Test
Stress Test
Structure Test
It determines the amount of execution time spent in various parts of the unit,
program throughput, and response time and device utilization by the program
unit.
Stress Test is those test designed to intentionally break the unit. A Great deal
can be learned about the strength and limitations of a program by examining the
manner in which a programmer in which a program unit breaks.
85
Structure Tests are concerned with exercising the internal logic of a program
and traversing particular execution paths. The way in which White-Box test
strategy was employed to ensure that the test cases could Guarantee that all
independent paths within a module have been have been exercised at least once.
Exercise all logical decisions on their true or false sides.
Execute all loops at their boundaries and within their operational
bounds.
Exercise internal data structures to assure their validity.
Checking attributes for their correctness.
Handling end of file condition, I/O errors, buffer problems and
textual errors in output information
87
This testing is also called as Glass box testing. In this testing, by knowing the
specific functions that a product has been design to perform test can be
conducted that demonstrate each function is fully operational at the same time
searching for errors in each function. It is a test case design method that uses the
control structure of the procedural design to derive test cases. Basis path testing
is a white box testing.
Basis path testing:
Flow graph notation
Cyclometric complexity
Deriving test cases
Graph matrices Control
88
which we can place specific test case design methods should be strategy should
have the following characteristics:
Testing begins at the module level and works “outward” toward the
integration of the entire computer based system.
Different testing techniques are appropriate at different points in
time.
The developer of the software and an independent test group
conducts testing.
Testing and Debugging are different activities but debugging must
be accommodated in any testing strategy.
89
examine the output. Condition testing exercises the logical conditions contained
in a module. The possible types of elements in a condition include a Boolean
operator, Boolean variable, a pair of Boolean parentheses A relational operator
or on arithmetic expression. Condition testing method focuses on testing each
condition in the program the purpose of condition test is to deduct not only
errors in the condition of a program but also other a errors in the program.
90
method for resolving deficiencies. Thus the proposed system under
consideration has been tested by using validation testing and found tobe
working satisfactorily. Though there were deficiencies in the system they were
not catastrophic.
7.5 MAINTENANCE
After a software system has been verified, tested and implemented, it
must continue to be maintained. Maintenance routines will vary depending on
the type and complexity of the technology. Many software systems will come
with a maintenance schedule or program recommended by the developer.
Maintenance could be provided by the developer as part of the purchase
agreement for the technology.
Ongoing monitoring or testing systems may be installed to ensure that
maintenance needs are identified and met where necessary. Where systems are
in long-term use, a system can be designed to monitor feedback from users and
conduct any modifications or maintenance as needed. Where modifications to
software are made as a result of system maintenance or upgrades, it may be
necessary to instigate further rounds of system verification and testing to ensure
that standards are still met by the modified system.
91
CONCLUSION
AND
FUTURE
ENHANCEMENT
92
CHAPTER 8
CONCLUSION AND FUTURE ENHANCEMENT
93
Future Enhancements:
94
REFERENCES
95
REFERENCES :
[1] M. E. Ayadi, M. S. Kamel, F. Karray, ―Survey on Speech Emotion
Recognition: Features, Classification Schemes, and Databases‖, Pattern
Recognition, vol. 44, pp. 572-587, 2011.
[2] S. K. Bhakre, A. Bang, ―Emotion Recognition on The Basis of Audio
Signal Using Naive Bayes Classifier‖, 2016 Intl. Conference on Advances in
Computing, Communications and Informatics (ICACCI), pp. 2363- 2367, 2016.
[3] I. Chiriacescu, ―Automatic Emotion Analysis Based On Speech‖, M.Sc.
THESIS Delft University of Technology, 2009.
[4] X. Chen, W. Han, H. Ruan, J. Liu, H. Li, D. Jiang, ―Sequence-to-sequence
Modelling for Categorical Speech Emotion Recognition Using Recurrent Neural
Network‖, 2018 First Asian Conference on Affective Computing and Intelligent
Interaction (ACII Asia), pp. 1-6, 2018.
[5] P. Cunningham, J. Loughrey, ―Over fitting in WrapperBased Feature
Subset Selection: The Harder You Try the Worse it Gets Research and
development in intelligent systems‖, XXI, 33-43, 2005.
[6] C. O. Dumitru, I. Gavat, ―A Comparative Study of Feature Extraction
Methods Applied to Continuous Speech Recognition in Romanian Language‖,
International Symphosium ELMAR, Zadar, Croatia, 2006.
[7] S. Emerich, E. Lupu, A. Apatean, ―Emotions Recognitions by Speech and
Facial Expressions Analysis‖, 17th European Signal Processing Conference,
2009.
[8] R. Elbarougy, M. Akagi, ―Cross-lingual speech emotion recognition
system based on a three-layer model for human perception‖, 2013 AsiaPacific
Signal and Information Processing Association Annual Summit and
Conference, pp. 1–10, 2013.
96
[9] D. J. France, R. G. Shiavi, ―Acoustical properties of speech as indicators of
depression and suicidal risk‖, IEEE Transactions on Biomedical Engineering,
pp. 829–837, 2000.
[10] P. Harár, R. Burget, M. K. Dutta, ―Speech Emotion Recognition with
Deep Learning‖, 2017 4th International Conference on Signal Processing and
Integrated Networks (SPIN), pp. 137-140, 2017.
[11] Q. Jin, C. Li, S. Chen, ―Speech emotion recognition with acoustic and
lexical features‖, PhD Proposal, pp. 4749–4753, 2015.
[12] Y. Kumar, N. Singh, ―An Automatic Spontaneous Live Speech
Recognition System for Punjabi Language Corpus‖, I J C T A, pp. 259-266,
2016.
[13] Y. Kumar, N. Singh, ―A First Step towards an Automatic Spontaneous
Speech Recognition System for Punjabi Language‖, International Journal of
Statistics and Reliability Engineering, pp. 81-93, 2015.
[14] Y. Kumar, N. Singh, ―An automatic speech recognition system for
spontaneous Punjabi speech corpus‖, International Journal of Speech
Technology, pp. 1-9, 2017.
[15] A. Khan, U. Kumar Roy, ―Emotion Recognition Using Prosodic and
Spectral Features of Speech and Naïve Bayes Classifier‖, 2017 International
Conference on Wireless Communications, Signal Processing and Networking
(WiSPNET), pp. 1017-1021, 2017.
[16] A. Kumar, K. Mahapatra, B. Kabi, A. Routray, ―A novel approach of
Speech Emotion Recognition with prosody, quality and derived features using
SVM classifier for a class of North-Eastern Languages‖, 2015 IEEE 2nd
International Conference on Recent Trends in Information Systems (ReTIS), pp.
372-377, 2015.
[17] Y. Kumar, N. Singh, ―Automatic Spontaneous Speech Recognition for
Punjabi Language Interview Speech Corpus‖, I.J. Education and Management
Engineering, pp. 64-73, 2016.
97
[18] G. Liu, W. He, B. Jin, ―Feature fusion of speech emotion recognition
based on deep Learning‖, 2018 International Conference on Network
Infrastructure and Digital Content (IC-NIDC), pp. 193-197, 2018.
[19] C. M. Lee, S. S. Narayanan, ―Toward detecting emotions in spoken
dialogs‖, IEEE Transactions on Speech and Audio Processing, pp. 293-303,
2005.
[20] S. Mirsamadi, E. Barsoum, C. Zhang, ―Automatic speech emotion
recognition using recurrent neural networks with local attention‖, 2017 IEEE
International Conference on Acoustics, Speech and Signal Processing
(ICASSP), pp. 2227-2231, 2017.
[21] A. Nogueiras, A. Moreno, A. Bonafonte, J. B. Marino, ―Speech Emotion
Recognition Using Hidden Markov Model‖, Eurospeech, 2001.
[22] J .Pohjalainen, P. Alku, ―Multi-scale modulation filtering in automatic
detection of emotions in telephone speech‖, International Conference on
Acoustic, Speech and Signal Processing, pp. 980- 984, 2014.
[23] S. Renjith, K. G. Manju, ―Speech Based Emotion Recognition in Tamil
and Telugu using LPCC and Hurst Parameters‖, 2017 International Conference
on circuits Power and Computing Technologies (ICCPCT), pp. 1-6, 2017.
Kernel References
https://github.com/marcogdepinto/emotion-classification-from-audio-
files?fbclid=IwAR2T4hhtWWfKdU4FwLS8LOAnF5sBwnmfc6PQH
TGidzLaLl1uUVOvicx7TVw
https://data-flair.training/blogs/python-mini-project-speech-emotion-
recognition/
98
APPENDIX
99
APPENDIX
(PUBLICATION DETAILS)
100
101
102
103
104
105
106