0% found this document useful (0 votes)

53 views

IT Report-1

Uploaded by

khushal

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

53 views

IT Report-1

Uploaded by

khushal

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 14

Indian Institute of Technology ,

Kharagpur

Industrial Training
(CH48013)

Summer internship Report

Name: Katra Khushal Jagdish

Roll No: 19CH30039
Guided By: Dr. Neha Soni

1
Index

S No. Content Page No.

1 Introduction 3
2 Problem Statement 4
3 What is Automatic Speech Recognition 4
4 Signal Analysis 6
5 Language Modeling 8
6 Traditional and state-of-the-art ASR 9
7 Indian Accent Speech Recognition 10
8 Design Philosophy of Architectures 11
9 Comparison between custom models 13
implemented
10 Comparison between our model and 14
original DeepSpeech Model
11 Conclusion 14

2
Introduction
Languify AI is a B2B SaaS that helps businesses track learning efficacy and in-class engagement
through an in-house developed conversational AI which ensures companies save tremendously
on operational time & costs by building AI solutions. Languify helps edtech business potential
by providing AI-powered solutions to broaden learning opportunities and equipping edtech
companies to uplift educators, academics and businesses. A major part of languify’s operation is
to build chatbots that help educators hone their communication skills and due to the fact that
majority of their sales is in India, it is important to have a speech recognition system that works
specifically for Indian English speakers considering the diversity that we have in our country and
the number of languages that we as a country speak. With over 700 languages in our country,
there is bound to be a great variation in the accent that we speak english. Hence, this project tries
to introduce a speech recognition model for Indian accent so that people that use these chatbots
feel comfortable and at ease while using the chatbot.

3
Problem Statement
Develop an Automatic speech recognition (ASR) system tailored for Indian English speakers and
determine the effect of Indian Accent on the model.

What is ASR?
Automatic Speech Recognition (ASR) refers to the technology or process that allows machines
to convert spoken language into written text. This capability has a wide range of applications
including:

1. Voice assistants: Devices or applications like Amazon's Alexa, Google Assistant, Apple's
Siri, and Microsoft's Cortana make use of ASR to process and understand voice
commands from users.
2. Transcription services: ASR can be used to convert spoken content, such as lectures or
interviews, into written transcripts.
3. Voice-controlled applications and devices: This includes everything from
voice-controlled light switches to software applications that can be navigated using voice.
4. Accessibility tools: ASR systems are often employed to assist individuals with
disabilities, for instance, converting spoken words into written text for those who have
difficulty writing or typing.
5. Call centers: Some modern call centers use ASR to transcribe or analyze the content of
phone calls.
6. Language learning: ASR can be used in applications to help language learners with
pronunciation and fluency.

The development and accuracy of ASR systems depend on various factors including the
complexity of the language, the clarity of the spoken words, background noise, and the speaker's
accent. Machine learning, deep learning (especially recurrent neural networks and transformers),
and vast amounts of training data have been pivotal in recent advancements in ASR technology.

Challenges Faced in ASR:

1. Variability of Volume:
○ Challenge: The loudness with which people speak can vary widely, even within a
single sentence. A speaker might start a sentence loudly and end it in a whisper, or
vice versa.
○ Impact: ASR systems might miss quieter segments of speech or be overwhelmed
by louder ones, leading to inaccuracies in transcription or recognition.
2. Variability of Words Speed:
○ Challenge: Different speakers have different rates of speech. Some people
naturally speak faster, while others speak slowly. Additionally, a single speaker

4
might vary their speed based on their emotional state, the topic of conversation, or
other factors.
○ Impact: Faster spoken words can sometimes be "mushed" together, making it
hard for the ASR system to distinguish individual words. On the other hand,
slower speech can sometimes insert gaps that the ASR might mistakenly interpret
as word or sentence breaks.
3. Variability of Speaker:
○ Challenge: Everyone's voice is unique. The way one person pronounces a word
can be significantly different from how another person pronounces it. This
variability can be due to factors like age, gender, health, regional accents, and
native language.
○ Impact: ASR systems trained predominantly on one group of speakers might
struggle to recognize speech from a different group. For example, a system
trained mostly on young adult voices might have difficulty understanding elderly
speakers.
4. Variability of Pitch:
○ Challenge: Pitch can vary due to the speaker's mood, the question's intonation, or
natural fluctuations in a person's voice.
○ Impact: Rapid changes in pitch can confuse an ASR system, especially if the
pitch moves outside the range the system was trained on. This can result in
misrecognition of words or phrases.
5. Noise:
○ Challenge: Real-world environments are often noisy. Background sounds—like
traffic, other people talking, music, or machinery—can interfere with the clear
reception of the spoken words.
○ Impact: Noise can mask or distort parts of the speech signal, making it difficult
for the ASR system to correctly identify words. Even in cases where the system
can distinguish speech from noise, the transcription's accuracy might be reduced.

Models in Speech Recognition:

● Acoustic Model (AM): It maps the relationship between audio signals and the phonetic
units (like phonemes or senones) of a language. In essence, the acoustic model predicts
which phonetic units make up a given audio segment. It represents sounds and their
transitions. It often captures the nuances of speech, like pitch, tone, and speed.
● Language Model (LM): It estimates the likelihood or probability of a sequence of words
occurring in the language. The LM provides context to ensure that the sequence of words
produced by the ASR system is linguistically valid and probable based on historical
patterns. It represents patterns in word usage, grammar, and can also capture semantics to
some extent.

5
Signal Analysis:

When we talk, we produce wave-like patterns in the air. Sounds with higher pitches have quicker
and more frequent vibrations than those with lower pitches. A microphone converts these sound
vibrations into electrical signals. If we pronounce "Hello World", the resulting signal would have
two distinct patterns. Our speech consists of multiple frequencies occurring simultaneously,
essentially being a combination of all these frequencies. To study the signal, we identify its
individual frequencies as key features. The Fourier transform allows us to dissect the signal into
these specific frequencies. This method helps in transforming the sound into a Spectrogram,
where we map frequency on a vertical scale against its occurrence over time. The depth of the
color in the spectrogram denotes the strength of the signal.

To generate a Spectrogram:

1. Segment the signal

based on specific time
intervals.
2. Use FFT (Fast Fourier
Transform) to break
down each segment
into its respective
frequency elements.
3. Represent each
segmented time frame
using a vector,
indicating the
amplitude for each
frequency.

6
Next, we’ll look at Feature Extraction techniques which would reduce the noise and
dimensionality of our data.

Feature Extraction with MFCC

Mel Frequency Cepstrum Coefficient (MFCC) Analysis involves simplifying an audio signal
to highlight its vital speech elements by utilizing both Mel frequency and Cepstral
evaluations. This process clusters the spectrum of frequencies into distinguishable bands that are
perceptible to human hearing. Additionally, the signal is divided into its source and filter
components to eliminate variations between speakers that don't pertain to their manner of
articulation.

a) Mel Frequency Analysis

For speech recognition, the only relevant frequencies are those that humans can perceive. We can
segment the frequencies from the Spectrogram into groups that align with human hearing, and
exclude sounds that are imperceptible to us.

b) Cepstral Analysis

We must also distinguish the aspects of sound that aren't tied to a specific speaker. We can
conceptualize the creation of human speech as a fusion of a distinct source and a filter. The
source is individualized, while the filter pertains to the way we all articulate words during
communication.

This model is utilized in cepstral analysis to differentiate between these two elements. The
cepstrum, obtained using an algorithm, allows us to eliminate the part of speech that is distinct to
each person's vocal cords, while retaining the form of the sound produced by the vocal tract.
Cepstral analysis combined with Mel frequency analysis get you 12 or 13 MFCC features related
to speech.

7
Thus, MFCC (Mel-frequency cepstral coefficients) Features Extraction,

● Reduced the dimensionality of our data and

● We squeeze noise out of the system

So there are 2 Acoustic Features for Speech Recognition:

● Spectrograms
● Mel-Frequency Cepstral Coefficients (MFCCs)

Language Modeling
Language Models integrate linguistic knowledge into the process of converting speech to text
during speech recognition. This integration helps to resolve uncertainties related to spelling and
context, determining which combinations of words are the most plausible.

For instance, due to its reliance on sound, an Acoustic Model can't differentiate between words
that sound similar, like "HERE" and "HEAR." The outcomes generated by the Acoustic Model
represent a probability distribution spanning various words. Each potential word sequence's
likelihood, given the audio signal, can be computed.

When both the Acoustic Model and the Language Model are available, the most probable
sequence emerges as a combination of these possibilities, utilizing the highest likelihood scores.
This involves utilizing the Acoustic Model derived from the audio signal and the Statistical
Language Model informed by language information.

Our objective is to estimate the probability of a specific sentence occurring within a text corpus.
To do this, we've observed that the probability of a word sequence can be determined by
chaining the probabilities of its historical words. N-grams provide an approximation of sequence
probability using the principle of the chain rule.

To manage the computational challenge posed by extensive calculations, we adopt the Markov
Assumption, which allows us to approximate a sequence's probability using a shorter sequence.
We calculate probabilities by leveraging counts of bigrams and individual tokens, where the
function "c" denotes counting.

8
Ultimately, we combine these calculated probabilities with probabilities from the Acoustic
Model to mitigate language-related ambiguities in the range of sequence options.

To summarize the above Speech-to-Text (STT) process,

1. We extract features from the audio speech signal with MFCC.

2. Use an HMM acoustic model to produce sound units, phonemes, words.

3. Uses statistical language models such as N-grams to straighten out language ambiguities
and create the final text sequence. Using Neural Language Model trained on massive amounts
of text, probabilities of spelling and context can be scored.

Traditional and State-of-the-art ASR

The conventional approach to Automatic Speech Recognition (ASR) involves the utilization of
feature extraction HMMs alongside language models. However, given the time-tracking
capabilities of Recurrent Neural Networks (RNNs), it becomes possible to substitute the
Acoustic model with a hybrid setup of RNNs and Connectionist Temporal Classification (CTC)
layers.

CTC layers effectively address the challenge of sequencing, which arises due to the necessity of
transforming audio signals with varying lengths into corresponding text. Through the integration
of RNNs and CTC layers, this sequencing hurdle can be overcome. Moreover, if Deep Neural
Networks (DNNs) are employed, the need for dedicated feature extraction and separate language
models might become unnecessary. This approach could streamline the ASR process while
potentially eliminating the requirement for additional components such as standalone feature
extraction and language models.

Speech Recognition with Custom Models

Below is the gist of architecture considerations while designing a deep learning model for speech
recognition.

● RNN Units: RNNs are employed due to their capability to effectively model sequential
data, such as the temporal nature of speech signals. They maintain memory of previous
time steps, aiding in understanding context over time.

9
● GRU Units: Gated Recurrent Units (GRUs) are utilized to address the issue of exploding
gradients that can occur when training simple RNNs. GRUs mitigate this problem,
ensuring more stable and efficient training by controlling gradient flow.
● Batch Normalization: This technique is applied to reduce training times by normalizing
the activations within a batch of data. It helps in accelerating convergence during
training, contributing to faster and more stable learning.
● TimeDistributed Layer: The TimeDistributed layer is used to identify intricate patterns
within sequential data. It applies a specified layer (like a dense or convolutional layer) to
every time step independently, allowing the model to capture complex temporal
relationships.
● CNN Layer: The inclusion of a 1D convolutional layer introduces an additional layer of
complexity to the ASR model. This layer is particularly adept at detecting local patterns
and features within the sequential data.
● Bidirectional RNNs: Bidirectional RNNs process data in two directions: forward and
backward. This is advantageous for ASR as it enables the model to exploit both past and
future context when making predictions, leading to more accurate recognition outcomes.

In summary, these components and techniques enhance the capabilities of an ASR system by
effectively capturing temporal dependencies, addressing training challenges, speeding up
training, capturing complex patterns, adding complexity through convolutional layers, and
making use of bidirectional information for improved recognition accuracy.

Indian Accent Speech Recognition

India is a very diverse nation and due to which, we speak different languages in different states
with different accents, so it becomes difficult to understand the effect of accent in these
traditional models. Due to Cepstral analysis, we effectively remove the accent part of the spoken
language during modeling and thus proceed with traditional ASR. To make the model recognize
such accent variations, we can train a pre-trained speech model, on a voice dataset having
spoken English recordings from many states. Here, we transfer-learn Baidu’s Deepspeech
model and analyze the recognition improvement using the dataset.

The dataset containing 50+ GB of Indic TTS voice database is downloaded from IITM Speech
Lab, which comprises 10000+ spoken sentences from 20+ states (both Male and Female native
speakers).

Our aim here is to figure out if we can improve the accuracy of the model by letting the model
learn from the accent of the spoken sentences. The transfer learning Baidu’s deepspeech model
does not take into account the accent of the spoken sentences while training so it acts as a
benchmark for our custom models that we train on accent rich sentences.

10
Design philosophy of the architectures
The models that we train follow a specific pattern of increasing complexity where we start with
training the data on a simple RNN layer and keep on increasing the complexity of the
architecture by either introducing another layer (could be a type of existing layer that we keep on
making deeper or a different kind of layer) or we increase the complexity of the present layer.

In our case, we trained five models with increasing complexities and their architectures are given
below:

Model 0: RNN

Model 1: RNN + TimeDistributed Dense

Model 2: CNN + RNN + TimeDistributed Dense

11
Model 3: Deeper RNN + TimeDistributed Dense

Model 4: Bidirectional RNN + TimeDistributed Dense

12
Comparison between the models implemented:

Model 2 has lower training loss but higher validation loss compared to all models. This denotes
the CNN layer though helps to lower the training loss it might be overfitting. Model 3 performs
the best in validation loss because deeper RNN layers help better model sequential data.
Bidirectional doesn't seem to help much as the length of sequential input is not much. In short,
deeper RNNs perform the best in validation loss.

Metrics:

WER (Word Error Rate), WACC (Word Accuracy), and BLEU (Bilingual Evaluation
Understudy) score are metrics used to evaluate the performance of various language processing
tasks, including Automatic Speech Recognition (ASR) and Machine Translation.

1. WER (Word Error Rate): WER is a common evaluation metric for ASR systems. It
measures the accuracy of the transcribed text by calculating the percentage of words that
are incorrectly recognized or substituted, deleted, or inserted compared to the reference
transcript. Lower WER values indicate better accuracy.
2. WACC (Word Accuracy): WACC is the complement of WER, representing the
accuracy of the transcribed text. It calculates the percentage of words that are correctly
recognized in the transcript. Higher WACC values indicate better accuracy.
3. BLEU (Bilingual Evaluation Understudy) Score: BLEU is a metric commonly used
for evaluating the quality of machine-generated translations. It measures the similarity
between the generated translation and one or more reference translations. BLEU
computes a precision-based score, considering n-grams (sequences of n words) in both
the generated and reference text. A higher BLEU score indicates better translation
quality, with a perfect score of 1.0 indicating exact match with the references.

These metrics are essential tools for assessing the performance of ASR and translation systems,
providing quantifiable measures to compare different models or approaches and guide their
improvement.

13
Comparing results between our model and the Original DeepSpeech
model
After training my model for Indian Accent English I compared the results between my model
and the original deep speech 0.5.1 model. For checking the accuracy of results, we used metrics
like WER, WACC and BLUE SCORE. For testing, we took around 1300 audio files of Indian
Accent English and its voice transcript.

Metrics Trained Model Deepspeech Model

Word Error Rate 0.157 0.44

BLUE Score 0.76 0.65

Word Accuracy 84.23 55.1

The metrics clearly show that our trained models performed better than the DeepSpeech model
for Indian Accent English.

Conclusion
While performing feature extraction (MFCC), we see that ‘Cepstral Analysis’ separates out the
accent component in Traditional ASR and we wanted to see if our custom model consisting of
neural networks can Intrinsically learn the accent features and perform better than traditional
ASR. Hence we transfer learn a pre trained model on Indian - English speech data to set a
baseline over which we targeted to perform. Also, from the custom models implemented, we can
see that increasing complexity of the model did not necessarily give better results and deeper
RNN gave the best results among the validation loss and training loss. From the metrics, it is
clear that our models performed better than DeepSpeech's pretrained transfer learning model and
hence we prove by case that we can easily extend this approach for any root language or locale
accent as well.

How To Hack Atm
87% (15)
How To Hack Atm
1 page
Christopher Langan - CTMU, The Cognitive-Theoretic Model of The Universe, A New Kind of Reality Theory
88% (8)
Christopher Langan - CTMU, The Cognitive-Theoretic Model of The Universe, A New Kind of Reality Theory
56 pages
Gayle Laakmann McDowell - Cracking The Coding Interview - 189 Programming Questions and Solutions (2015, CareerCup)
81% (48)
Gayle Laakmann McDowell - Cracking The Coding Interview - 189 Programming Questions and Solutions (2015, CareerCup)
708 pages
Gödel, Escher, Bach - An Eternal Golden Braid (20th Anniversary Edition) by Douglas R. Hofstadter (Charm-Quark) PDF
100% (10)
Gödel, Escher, Bach - An Eternal Golden Braid (20th Anniversary Edition) by Douglas R. Hofstadter (Charm-Quark) PDF
821 pages
A Coomer's Guide To AI Dungeon
No ratings yet
A Coomer's Guide To AI Dungeon
30 pages
New Password B2 ST 1B-1
100% (1)
New Password B2 ST 1B-1
1 page
Chris Bailey - Hyperfocus - The New Science of Attention, Productivity, and Creativity-Viking (2018)
100% (25)
Chris Bailey - Hyperfocus - The New Science of Attention, Productivity, and Creativity-Viking (2018)
306 pages
AI Money Machine
100% (2)
AI Money Machine
267 pages
C1 C2 Adjectives
No ratings yet
C1 C2 Adjectives
7 pages
Generative AI For Beginners1
100% (1)
Generative AI For Beginners1
85 pages
Banana Pancakes - Ukulele Chord Chart
100% (1)
Banana Pancakes - Ukulele Chord Chart
2 pages
A Bilingualized Dictionary Yoruba
100% (4)
A Bilingualized Dictionary Yoruba
118 pages
Languo: Pre-Intermediate Student'S Book
50% (4)
Languo: Pre-Intermediate Student'S Book
11 pages
Gestalt
100% (3)
Gestalt
39 pages
75 Productivity Hacks - System Sunday
100% (7)
75 Productivity Hacks - System Sunday
75 pages
Military Remote Viewing Manual
100% (5)
Military Remote Viewing Manual
72 pages
Speech Recognition
No ratings yet
Speech Recognition
4 pages
(IJCST-V4I2P62) :Dr.V.Ajantha Devi, Ms.V.Suganya
No ratings yet
(IJCST-V4I2P62) :Dr.V.Ajantha Devi, Ms.V.Suganya
6 pages
Speech Recognition Using Neural Networks: A. Types of Speech Utterance
No ratings yet
Speech Recognition Using Neural Networks: A. Types of Speech Utterance
24 pages
Speech Recognition
No ratings yet
Speech Recognition
7 pages
Speech Processing
No ratings yet
Speech Processing
70 pages
Review of Feature Extraction Techniques in Automatic Speech Recognition
100% (1)
Review of Feature Extraction Techniques in Automatic Speech Recognition
6 pages
Speech Recognition1
No ratings yet
Speech Recognition1
24 pages
NLP Project Reportttt
No ratings yet
NLP Project Reportttt
9 pages
Term Paper ECE-300 Topic: - Speech Recognition
No ratings yet
Term Paper ECE-300 Topic: - Speech Recognition
14 pages
Speech Recognition Seminar
No ratings yet
Speech Recognition Seminar
19 pages
Artificial Intelligence in Voice Recognition
No ratings yet
Artificial Intelligence in Voice Recognition
14 pages
Speech Recognition Using Matrix Comparison: Vishnupriya Gupta
No ratings yet
Speech Recognition Using Matrix Comparison: Vishnupriya Gupta
3 pages
SPEECH RECOGNITION SYSTEM Final
No ratings yet
SPEECH RECOGNITION SYSTEM Final
16 pages
Punjabi A
No ratings yet
Punjabi A
7 pages
Lit Rev Dis 1
No ratings yet
Lit Rev Dis 1
24 pages
AI Speech Recognition Document
No ratings yet
AI Speech Recognition Document
26 pages
As R Tutorial
No ratings yet
As R Tutorial
16 pages
Using Gaussian Mixture: Automatic Speaker Recognition Speaker Models
No ratings yet
Using Gaussian Mixture: Automatic Speaker Recognition Speaker Models
20 pages
Artificial Intelligence-An Introduction: Department of Computer Science & Engineering
No ratings yet
Artificial Intelligence-An Introduction: Department of Computer Science & Engineering
17 pages
Text and Speech CCS369-UNIT 5
No ratings yet
Text and Speech CCS369-UNIT 5
9 pages
Meeting Insights Summarisation Using Speech Recognition
No ratings yet
Meeting Insights Summarisation Using Speech Recognition
8 pages
s11277 024 11448 X
No ratings yet
s11277 024 11448 X
35 pages
SPEECH
100% (1)
SPEECH
17 pages
25 The Comprehensive Analysis Speech Recognition System
No ratings yet
25 The Comprehensive Analysis Speech Recognition System
5 pages
Speech Recognition Full Report
No ratings yet
Speech Recognition Full Report
11 pages
(IJCST-V10I3P32) :rizwan K Rahim, Tharikh Bin Siyad, Muhammed Ameen M.A, Muhammed Salim K.T, Selin M
No ratings yet
(IJCST-V10I3P32) :rizwan K Rahim, Tharikh Bin Siyad, Muhammed Ameen M.A, Muhammed Salim K.T, Selin M
6 pages
Speech Processing -Anu
No ratings yet
Speech Processing -Anu
78 pages
Russia Project
No ratings yet
Russia Project
14 pages
Speech Recognition Using Neural Networks IJERTV7IS100087
No ratings yet
Speech Recognition Using Neural Networks IJERTV7IS100087
7 pages
Shareef Seminar Docs
No ratings yet
Shareef Seminar Docs
24 pages
Research paper
No ratings yet
Research paper
9 pages
Speech Recognition System - A Review: April 2016
No ratings yet
Speech Recognition System - A Review: April 2016
10 pages
Speaker Recognition From Whisper
No ratings yet
Speaker Recognition From Whisper
47 pages
Nlp Unit i Notes
No ratings yet
Nlp Unit i Notes
29 pages
Enhancing Non-Native Accent Recognition Through A Combination of Speaker Embeddings, Prosodic and Vocal Speech Features
No ratings yet
Enhancing Non-Native Accent Recognition Through A Combination of Speaker Embeddings, Prosodic and Vocal Speech Features
13 pages
2 3 13 Artificialintelligence
No ratings yet
2 3 13 Artificialintelligence
10 pages
Speech Recognition As Emerging Revolutionary Technology
No ratings yet
Speech Recognition As Emerging Revolutionary Technology
4 pages
Text-to-Speech (TTS) System
No ratings yet
Text-to-Speech (TTS) System
11 pages
Devi Priya SECOND PAPER
No ratings yet
Devi Priya SECOND PAPER
7 pages
Automatic Speech Recognition Using Limited Vocabulary2
No ratings yet
Automatic Speech Recognition Using Limited Vocabulary2
22 pages
Speech Recognition
0% (1)
Speech Recognition
27 pages
Ijarcet Vol 4 Issue 7 3067 3072 PDF
No ratings yet
Ijarcet Vol 4 Issue 7 3067 3072 PDF
6 pages
Utterance Based Speaker Identification
No ratings yet
Utterance Based Speaker Identification
14 pages
Joy Sarkar - 20 CSE 012
No ratings yet
Joy Sarkar - 20 CSE 012
30 pages
Write: Get Unlimited Access To The Best of Medium For Less Than $1/week
No ratings yet
Write: Get Unlimited Access To The Best of Medium For Less Than $1/week
19 pages
A Novel Approach for Stutter Speech Recognition and Correction
No ratings yet
A Novel Approach for Stutter Speech Recognition and Correction
6 pages
Question
100% (1)
Question
17 pages
Speech
No ratings yet
Speech
7 pages
A Framework For Deepfake V2
No ratings yet
A Framework For Deepfake V2
24 pages
Tsa Ut V
No ratings yet
Tsa Ut V
9 pages
8.5 Multilingual Speech Processing
No ratings yet
8.5 Multilingual Speech Processing
24 pages
Removal of Spectral Discontinuity in ConcatenatedSpeech Waveform
No ratings yet
Removal of Spectral Discontinuity in ConcatenatedSpeech Waveform
5 pages
The PC Interfaced Voice Recognition System Is To Implement A Password For Authentication
No ratings yet
The PC Interfaced Voice Recognition System Is To Implement A Password For Authentication
7 pages
Application and Development Prospect of AI Speech Recognition Technology
No ratings yet
Application and Development Prospect of AI Speech Recognition Technology
5 pages
1 Paper
No ratings yet
1 Paper
9 pages
_speech recognition system
No ratings yet
_speech recognition system
12 pages
Speech Recognition1
100% (1)
Speech Recognition1
39 pages
Converting Speech to Text
No ratings yet
Converting Speech to Text
48 pages
Speech Recognition: Fundamentals and Applications
From Everand
Speech Recognition: Fundamentals and Applications
Fouad Sabry
No ratings yet
Grammar and Linguistics: Core Concepts
From Everand
Grammar and Linguistics: Core Concepts
Saraswati Saini
No ratings yet
Teas Topics To Study
100% (12)
Teas Topics To Study
6 pages
Test Ninjas Digital Sat Math Cheat Sheet
100% (4)
Test Ninjas Digital Sat Math Cheat Sheet
38 pages
My Ai Cheat List
100% (11)
My Ai Cheat List
3 pages
2045: The Year Man Becomes Immortal
No ratings yet
2045: The Year Man Becomes Immortal
9 pages
Attention Is All You Need
50% (2)
Attention Is All You Need
11 pages
Roadmap How To Learn AI in 2024 (Uncovered AI)
No ratings yet
Roadmap How To Learn AI in 2024 (Uncovered AI)
6 pages
Algebra Workbook
100% (3)
Algebra Workbook
299 pages
Rationality From AI To Zombies
86% (7)
Rationality From AI To Zombies
1,813 pages
If I Ain't Got You - Alicia Keys
No ratings yet
If I Ain't Got You - Alicia Keys
2 pages
Mythic Magazine #009
100% (3)
Mythic Magazine #009
27 pages
Situationalawareness 1 30
No ratings yet
Situationalawareness 1 30
30 pages
The Secrets of A Slot Machine
No ratings yet
The Secrets of A Slot Machine
4 pages
Yuval Noah Harari Argues That AI Has Hacked The Operating System of Human Civilisation
100% (1)
Yuval Noah Harari Argues That AI Has Hacked The Operating System of Human Civilisation
7 pages
Guide of Recruitment - Bethi Arun Kumar
100% (2)
Guide of Recruitment - Bethi Arun Kumar
21 pages
Psych Unit 7a Practice Quiz
No ratings yet
Psych Unit 7a Practice Quiz
4 pages
Current and Future Trends on AI Applications - Mohammed A Al-Sharafi
No ratings yet
Current and Future Trends on AI Applications - Mohammed A Al-Sharafi
456 pages
Improved Statistical Test
87% (171)
Improved Statistical Test
20 pages
CAN Bus - The Ultimate Guide
100% (3)
CAN Bus - The Ultimate Guide
114 pages
Tugas Besar 2 MKCU BI-2 (Instructions To Students)
No ratings yet
Tugas Besar 2 MKCU BI-2 (Instructions To Students)
3 pages
DLL Cot Creative Writing
100% (1)
DLL Cot Creative Writing
2 pages
MY English Workbook: 5Th Grade
No ratings yet
MY English Workbook: 5Th Grade
34 pages
Ctevt Exam 2080
No ratings yet
Ctevt Exam 2080
2 pages
Para Sa Iels
No ratings yet
Para Sa Iels
5 pages
SSC CHSL Tier 1 Question Paper 24-5-2022 All Shift in Hindi Sscstudy
No ratings yet
SSC CHSL Tier 1 Question Paper 24-5-2022 All Shift in Hindi Sscstudy
77 pages
Hughes Etal 2012 ch4
No ratings yet
Hughes Etal 2012 ch4
9 pages
Studying Abroad Worksheet 1
100% (1)
Studying Abroad Worksheet 1
2 pages
Patterns of Organization: Source: More Reading Power
No ratings yet
Patterns of Organization: Source: More Reading Power
11 pages
IJCRT22A6335
No ratings yet
IJCRT22A6335
7 pages
bài 3,4,5,6 (5đ)
No ratings yet
bài 3,4,5,6 (5đ)
5 pages
Bikol Dictionary: Diksionáriong Bíkol Malcolm W Mintz
No ratings yet
Bikol Dictionary: Diksionáriong Bíkol Malcolm W Mintz
10 pages
Lesson Plan Literacy
100% (2)
Lesson Plan Literacy
4 pages
unit 6 6
No ratings yet
unit 6 6
2 pages
Rubrik Writing
No ratings yet
Rubrik Writing
3 pages
Primary Year 1 Scheme of Work Phonics
No ratings yet
Primary Year 1 Scheme of Work Phonics
60 pages
(Play) : - S, - Es or - Ies To The Verbs Below To Form The Present Simple For He, She or It
No ratings yet
(Play) : - S, - Es or - Ies To The Verbs Below To Form The Present Simple For He, She or It
2 pages
LP Week 3
No ratings yet
LP Week 3
19 pages
Commands, Requests and Questions
No ratings yet
Commands, Requests and Questions
5 pages
Direct and Indirect Speech, Explained in Spanish
No ratings yet
Direct and Indirect Speech, Explained in Spanish
4 pages
STD 3 End of Term 1 Language Arts Test
No ratings yet
STD 3 End of Term 1 Language Arts Test
6 pages
A.introduction To Caregiving
No ratings yet
A.introduction To Caregiving
11 pages
If and When - Future-Conditionals 0-1
No ratings yet
If and When - Future-Conditionals 0-1
3 pages
Correos Electrónicos English - Grammar - Exercises - With - Answers - Part - 3 - Your - Quest - Towards
100% (1)
Correos Electrónicos English - Grammar - Exercises - With - Answers - Part - 3 - Your - Quest - Towards
466 pages
Dissertation Arwa Al-Moghrabi 120038
No ratings yet
Dissertation Arwa Al-Moghrabi 120038
79 pages
Trương Thùy Khánh Vân - 63NNA5: Compound Sentences
No ratings yet
Trương Thùy Khánh Vân - 63NNA5: Compound Sentences
8 pages

IT Report-1

Uploaded by

IT Report-1

Uploaded by

Indian Institute of Technology ,

Summer internship Report

Name: Katra Khushal Jagdish

S No. Content Page No.

Challenges Faced in ASR:

Models in Speech Recognition:

1. Segment the signal

Feature Extraction with MFCC

a) Mel Frequency Analysis

● Reduced the dimensionality of our data and

So there are 2 Acoustic Features for Speech Recognition:

To summarize the above Speech-to-Text (STT) process,

1. We extract features from the audio speech signal with MFCC.

2. Use an HMM acoustic model to produce sound units, phonemes, words.

Traditional and State-of-the-art ASR

Speech Recognition with Custom Models

Indian Accent Speech Recognition

Model 1: RNN + TimeDistributed Dense

Model 2: CNN + RNN + TimeDistributed Dense

Model 4: Bidirectional RNN + TimeDistributed Dense

Metrics Trained Model Deepspeech Model

BLUE Score 0.76 0.65

You might also like