0% found this document useful (0 votes)
53 views

IT Report-1

Uploaded by

khushal
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
53 views

IT Report-1

Uploaded by

khushal
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

Indian Institute of Technology ,

Kharagpur

Industrial Training
(CH48013)

Summer internship Report

Name: Katra Khushal Jagdish


Roll No: 19CH30039
Guided By: Dr. Neha Soni

1
Index

S No. Content Page No.


1 Introduction 3
2 Problem Statement 4
3 What is Automatic Speech Recognition 4
4 Signal Analysis 6
5 Language Modeling 8
6 Traditional and state-of-the-art ASR 9
7 Indian Accent Speech Recognition 10
8 Design Philosophy of Architectures 11
9 Comparison between custom models 13
implemented
10 Comparison between our model and 14
original DeepSpeech Model
11 Conclusion 14

2
Introduction
Languify AI is a B2B SaaS that helps businesses track learning efficacy and in-class engagement
through an in-house developed conversational AI which ensures companies save tremendously
on operational time & costs by building AI solutions. Languify helps edtech business potential
by providing AI-powered solutions to broaden learning opportunities and equipping edtech
companies to uplift educators, academics and businesses. A major part of languify’s operation is
to build chatbots that help educators hone their communication skills and due to the fact that
majority of their sales is in India, it is important to have a speech recognition system that works
specifically for Indian English speakers considering the diversity that we have in our country and
the number of languages that we as a country speak. With over 700 languages in our country,
there is bound to be a great variation in the accent that we speak english. Hence, this project tries
to introduce a speech recognition model for Indian accent so that people that use these chatbots
feel comfortable and at ease while using the chatbot.

3
Problem Statement
Develop an Automatic speech recognition (ASR) system tailored for Indian English speakers and
determine the effect of Indian Accent on the model.

What is ASR?
Automatic Speech Recognition (ASR) refers to the technology or process that allows machines
to convert spoken language into written text. This capability has a wide range of applications
including:

1. Voice assistants: Devices or applications like Amazon's Alexa, Google Assistant, Apple's
Siri, and Microsoft's Cortana make use of ASR to process and understand voice
commands from users.
2. Transcription services: ASR can be used to convert spoken content, such as lectures or
interviews, into written transcripts.
3. Voice-controlled applications and devices: This includes everything from
voice-controlled light switches to software applications that can be navigated using voice.
4. Accessibility tools: ASR systems are often employed to assist individuals with
disabilities, for instance, converting spoken words into written text for those who have
difficulty writing or typing.
5. Call centers: Some modern call centers use ASR to transcribe or analyze the content of
phone calls.
6. Language learning: ASR can be used in applications to help language learners with
pronunciation and fluency.

The development and accuracy of ASR systems depend on various factors including the
complexity of the language, the clarity of the spoken words, background noise, and the speaker's
accent. Machine learning, deep learning (especially recurrent neural networks and transformers),
and vast amounts of training data have been pivotal in recent advancements in ASR technology.

Challenges Faced in ASR:


1. Variability of Volume:
○ Challenge: The loudness with which people speak can vary widely, even within a
single sentence. A speaker might start a sentence loudly and end it in a whisper, or
vice versa.
○ Impact: ASR systems might miss quieter segments of speech or be overwhelmed
by louder ones, leading to inaccuracies in transcription or recognition.
2. Variability of Words Speed:
○ Challenge: Different speakers have different rates of speech. Some people
naturally speak faster, while others speak slowly. Additionally, a single speaker

4
might vary their speed based on their emotional state, the topic of conversation, or
other factors.
○ Impact: Faster spoken words can sometimes be "mushed" together, making it
hard for the ASR system to distinguish individual words. On the other hand,
slower speech can sometimes insert gaps that the ASR might mistakenly interpret
as word or sentence breaks.
3. Variability of Speaker:
○ Challenge: Everyone's voice is unique. The way one person pronounces a word
can be significantly different from how another person pronounces it. This
variability can be due to factors like age, gender, health, regional accents, and
native language.
○ Impact: ASR systems trained predominantly on one group of speakers might
struggle to recognize speech from a different group. For example, a system
trained mostly on young adult voices might have difficulty understanding elderly
speakers.
4. Variability of Pitch:
○ Challenge: Pitch can vary due to the speaker's mood, the question's intonation, or
natural fluctuations in a person's voice.
○ Impact: Rapid changes in pitch can confuse an ASR system, especially if the
pitch moves outside the range the system was trained on. This can result in
misrecognition of words or phrases.
5. Noise:
○ Challenge: Real-world environments are often noisy. Background sounds—like
traffic, other people talking, music, or machinery—can interfere with the clear
reception of the spoken words.
○ Impact: Noise can mask or distort parts of the speech signal, making it difficult
for the ASR system to correctly identify words. Even in cases where the system
can distinguish speech from noise, the transcription's accuracy might be reduced.

Models in Speech Recognition:


● Acoustic Model (AM): It maps the relationship between audio signals and the phonetic
units (like phonemes or senones) of a language. In essence, the acoustic model predicts
which phonetic units make up a given audio segment. It represents sounds and their
transitions. It often captures the nuances of speech, like pitch, tone, and speed.
● Language Model (LM): It estimates the likelihood or probability of a sequence of words
occurring in the language. The LM provides context to ensure that the sequence of words
produced by the ASR system is linguistically valid and probable based on historical
patterns. It represents patterns in word usage, grammar, and can also capture semantics to
some extent.

5
Signal Analysis:

When we talk, we produce wave-like patterns in the air. Sounds with higher pitches have quicker
and more frequent vibrations than those with lower pitches. A microphone converts these sound
vibrations into electrical signals. If we pronounce "Hello World", the resulting signal would have
two distinct patterns. Our speech consists of multiple frequencies occurring simultaneously,
essentially being a combination of all these frequencies. To study the signal, we identify its
individual frequencies as key features. The Fourier transform allows us to dissect the signal into
these specific frequencies. This method helps in transforming the sound into a Spectrogram,
where we map frequency on a vertical scale against its occurrence over time. The depth of the
color in the spectrogram denotes the strength of the signal.

To generate a Spectrogram:

1. Segment the signal


based on specific time
intervals.
2. Use FFT (Fast Fourier
Transform) to break
down each segment
into its respective
frequency elements.
3. Represent each
segmented time frame
using a vector,
indicating the
amplitude for each
frequency.

6
Next, we’ll look at Feature Extraction techniques which would reduce the noise and
dimensionality of our data.

Feature Extraction with MFCC

Mel Frequency Cepstrum Coefficient (MFCC) Analysis involves simplifying an audio signal
to highlight its vital speech elements by utilizing both Mel frequency and Cepstral
evaluations. This process clusters the spectrum of frequencies into distinguishable bands that are
perceptible to human hearing. Additionally, the signal is divided into its source and filter
components to eliminate variations between speakers that don't pertain to their manner of
articulation.

a) Mel Frequency Analysis

For speech recognition, the only relevant frequencies are those that humans can perceive. We can
segment the frequencies from the Spectrogram into groups that align with human hearing, and
exclude sounds that are imperceptible to us.

b) Cepstral Analysis

We must also distinguish the aspects of sound that aren't tied to a specific speaker. We can
conceptualize the creation of human speech as a fusion of a distinct source and a filter. The
source is individualized, while the filter pertains to the way we all articulate words during
communication.

This model is utilized in cepstral analysis to differentiate between these two elements. The
cepstrum, obtained using an algorithm, allows us to eliminate the part of speech that is distinct to
each person's vocal cords, while retaining the form of the sound produced by the vocal tract.
Cepstral analysis combined with Mel frequency analysis get you 12 or 13 MFCC features related
to speech.

7
Thus, MFCC (Mel-frequency cepstral coefficients) Features Extraction,

● Reduced the dimensionality of our data and


● We squeeze noise out of the system

So there are 2 Acoustic Features for Speech Recognition:

● Spectrograms
● Mel-Frequency Cepstral Coefficients (MFCCs)

Language Modeling
Language Models integrate linguistic knowledge into the process of converting speech to text
during speech recognition. This integration helps to resolve uncertainties related to spelling and
context, determining which combinations of words are the most plausible.

For instance, due to its reliance on sound, an Acoustic Model can't differentiate between words
that sound similar, like "HERE" and "HEAR." The outcomes generated by the Acoustic Model
represent a probability distribution spanning various words. Each potential word sequence's
likelihood, given the audio signal, can be computed.

When both the Acoustic Model and the Language Model are available, the most probable
sequence emerges as a combination of these possibilities, utilizing the highest likelihood scores.
This involves utilizing the Acoustic Model derived from the audio signal and the Statistical
Language Model informed by language information.

Our objective is to estimate the probability of a specific sentence occurring within a text corpus.
To do this, we've observed that the probability of a word sequence can be determined by
chaining the probabilities of its historical words. N-grams provide an approximation of sequence
probability using the principle of the chain rule.

To manage the computational challenge posed by extensive calculations, we adopt the Markov
Assumption, which allows us to approximate a sequence's probability using a shorter sequence.
We calculate probabilities by leveraging counts of bigrams and individual tokens, where the
function "c" denotes counting.

8
Ultimately, we combine these calculated probabilities with probabilities from the Acoustic
Model to mitigate language-related ambiguities in the range of sequence options.

To summarize the above Speech-to-Text (STT) process,

1. We extract features from the audio speech signal with MFCC.

2. Use an HMM acoustic model to produce sound units, phonemes, words.

3. Uses statistical language models such as N-grams to straighten out language ambiguities
and create the final text sequence. Using Neural Language Model trained on massive amounts
of text, probabilities of spelling and context can be scored.

Traditional and State-of-the-art ASR


The conventional approach to Automatic Speech Recognition (ASR) involves the utilization of
feature extraction HMMs alongside language models. However, given the time-tracking
capabilities of Recurrent Neural Networks (RNNs), it becomes possible to substitute the
Acoustic model with a hybrid setup of RNNs and Connectionist Temporal Classification (CTC)
layers.

CTC layers effectively address the challenge of sequencing, which arises due to the necessity of
transforming audio signals with varying lengths into corresponding text. Through the integration
of RNNs and CTC layers, this sequencing hurdle can be overcome. Moreover, if Deep Neural
Networks (DNNs) are employed, the need for dedicated feature extraction and separate language
models might become unnecessary. This approach could streamline the ASR process while
potentially eliminating the requirement for additional components such as standalone feature
extraction and language models.

Speech Recognition with Custom Models

Below is the gist of architecture considerations while designing a deep learning model for speech
recognition.

● RNN Units: RNNs are employed due to their capability to effectively model sequential
data, such as the temporal nature of speech signals. They maintain memory of previous
time steps, aiding in understanding context over time.

9
● GRU Units: Gated Recurrent Units (GRUs) are utilized to address the issue of exploding
gradients that can occur when training simple RNNs. GRUs mitigate this problem,
ensuring more stable and efficient training by controlling gradient flow.
● Batch Normalization: This technique is applied to reduce training times by normalizing
the activations within a batch of data. It helps in accelerating convergence during
training, contributing to faster and more stable learning.
● TimeDistributed Layer: The TimeDistributed layer is used to identify intricate patterns
within sequential data. It applies a specified layer (like a dense or convolutional layer) to
every time step independently, allowing the model to capture complex temporal
relationships.
● CNN Layer: The inclusion of a 1D convolutional layer introduces an additional layer of
complexity to the ASR model. This layer is particularly adept at detecting local patterns
and features within the sequential data.
● Bidirectional RNNs: Bidirectional RNNs process data in two directions: forward and
backward. This is advantageous for ASR as it enables the model to exploit both past and
future context when making predictions, leading to more accurate recognition outcomes.

In summary, these components and techniques enhance the capabilities of an ASR system by
effectively capturing temporal dependencies, addressing training challenges, speeding up
training, capturing complex patterns, adding complexity through convolutional layers, and
making use of bidirectional information for improved recognition accuracy.

Indian Accent Speech Recognition


India is a very diverse nation and due to which, we speak different languages in different states
with different accents, so it becomes difficult to understand the effect of accent in these
traditional models. Due to Cepstral analysis, we effectively remove the accent part of the spoken
language during modeling and thus proceed with traditional ASR. To make the model recognize
such accent variations, we can train a pre-trained speech model, on a voice dataset having
spoken English recordings from many states. Here, we transfer-learn Baidu’s Deepspeech
model and analyze the recognition improvement using the dataset.

The dataset containing 50+ GB of Indic TTS voice database is downloaded from IITM Speech
Lab, which comprises 10000+ spoken sentences from 20+ states (both Male and Female native
speakers).

Our aim here is to figure out if we can improve the accuracy of the model by letting the model
learn from the accent of the spoken sentences. The transfer learning Baidu’s deepspeech model
does not take into account the accent of the spoken sentences while training so it acts as a
benchmark for our custom models that we train on accent rich sentences.

10
Design philosophy of the architectures
The models that we train follow a specific pattern of increasing complexity where we start with
training the data on a simple RNN layer and keep on increasing the complexity of the
architecture by either introducing another layer (could be a type of existing layer that we keep on
making deeper or a different kind of layer) or we increase the complexity of the present layer.

In our case, we trained five models with increasing complexities and their architectures are given
below:

Model 0: RNN

Model 1: RNN + TimeDistributed Dense

Model 2: CNN + RNN + TimeDistributed Dense

11
Model 3: Deeper RNN + TimeDistributed Dense

Model 4: Bidirectional RNN + TimeDistributed Dense

12
Comparison between the models implemented:

Model 2 has lower training loss but higher validation loss compared to all models. This denotes
the CNN layer though helps to lower the training loss it might be overfitting. Model 3 performs
the best in validation loss because deeper RNN layers help better model sequential data.
Bidirectional doesn't seem to help much as the length of sequential input is not much. In short,
deeper RNNs perform the best in validation loss.

Metrics:

WER (Word Error Rate), WACC (Word Accuracy), and BLEU (Bilingual Evaluation
Understudy) score are metrics used to evaluate the performance of various language processing
tasks, including Automatic Speech Recognition (ASR) and Machine Translation.

1. WER (Word Error Rate): WER is a common evaluation metric for ASR systems. It
measures the accuracy of the transcribed text by calculating the percentage of words that
are incorrectly recognized or substituted, deleted, or inserted compared to the reference
transcript. Lower WER values indicate better accuracy.
2. WACC (Word Accuracy): WACC is the complement of WER, representing the
accuracy of the transcribed text. It calculates the percentage of words that are correctly
recognized in the transcript. Higher WACC values indicate better accuracy.
3. BLEU (Bilingual Evaluation Understudy) Score: BLEU is a metric commonly used
for evaluating the quality of machine-generated translations. It measures the similarity
between the generated translation and one or more reference translations. BLEU
computes a precision-based score, considering n-grams (sequences of n words) in both
the generated and reference text. A higher BLEU score indicates better translation
quality, with a perfect score of 1.0 indicating exact match with the references.

These metrics are essential tools for assessing the performance of ASR and translation systems,
providing quantifiable measures to compare different models or approaches and guide their
improvement.

13
Comparing results between our model and the Original DeepSpeech
model
After training my model for Indian Accent English I compared the results between my model
and the original deep speech 0.5.1 model. For checking the accuracy of results, we used metrics
like WER, WACC and BLUE SCORE. For testing, we took around 1300 audio files of Indian
Accent English and its voice transcript.

Metrics Trained Model Deepspeech Model


Word Error Rate 0.157 0.44

BLUE Score 0.76 0.65


Word Accuracy 84.23 55.1

The metrics clearly show that our trained models performed better than the DeepSpeech model
for Indian Accent English.

Conclusion
While performing feature extraction (MFCC), we see that ‘Cepstral Analysis’ separates out the
accent component in Traditional ASR and we wanted to see if our custom model consisting of
neural networks can Intrinsically learn the accent features and perform better than traditional
ASR. Hence we transfer learn a pre trained model on Indian - English speech data to set a
baseline over which we targeted to perform. Also, from the custom models implemented, we can
see that increasing complexity of the model did not necessarily give better results and deeper
RNN gave the best results among the validation loss and training loss. From the metrics, it is
clear that our models performed better than DeepSpeech's pretrained transfer learning model and
hence we prove by case that we can easily extend this approach for any root language or locale
accent as well.

14

You might also like