0% found this document useful (0 votes)
1 views

7sem_projectreport

The Speech to Text Transcript project aims to enhance speech recognition and audio processing by converting spoken sentences into text using a Flask-powered web application. It supports various input types, including microphones, audio files, and video files, and offers robust audio conversion capabilities. The project addresses inefficiencies in manual audio processing and aims to provide a user-friendly interface for efficient speech recognition solutions.
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
1 views

7sem_projectreport

The Speech to Text Transcript project aims to enhance speech recognition and audio processing by converting spoken sentences into text using a Flask-powered web application. It supports various input types, including microphones, audio files, and video files, and offers robust audio conversion capabilities. The project addresses inefficiencies in manual audio processing and aims to provide a user-friendly interface for efficient speech recognition solutions.
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 33

Speech to Text Transcript

BCSF187Z50 - PROJECT WORK PHASE- I


REPORT
Batch 2023 (Semester VII)

SUBMITTED BY

Aniketh Vustepalle
11209A021
Ramacharla.Sai Teja
11209A013

GUIDED BY
Dr. M. Senthil Kumaran
Associate Professor
Dept. of CSE

DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING

SRI CHANDRASEKHARENDRA SARASWATHI VISWA


MAHAVIDYALAYA

Nov 2023
Sri Chandrasekharendra Saraswathi Viswa Mahavidyalaya
Enathur, Kanchipuram – 631 561

BONAFIDE CERTIFICATE

This is to certify that the PROJECT WORK PHASE-I Report entitled Speech to Text

Transcript is the bonafide work carried out by Mr. Vustepalle Aniketh &

Ramacharla.Sai Teja Reg.No 11209A021 & 11209A013 during the academic year

------------------------.

Dr. M. Senthil Kumaran Dr. M. Senthil Kumaran


Associate Professor, Head of the Department,
Department of CSE, Department of CSE,

SCSVMV. SCSVMV.

Submitted for the project work viva - voce examination held on _________________

Place : Kanchipuram.
Date :
Examiner 1 Examiner 2

DECLARATION

It is certified that the PROJECT WORK PHASE-I work titled Speech to Text Transcript
is originally implemented by me. No ideas, processes, results or words of others have
been presented as our own work. Due acknowledgement is given wherever others' work
or ideas are utilized.

a. There is no fabrication of data or results which have been compiled /analyzed.


b. There is no falsification by manipulating data or processes, or changing or
omitting data or results.

We understand that the project is liable to be rejected at any stage (even at a later
date) if it is discovered that the project has been plagiarized, or significant code has
been copied. We understand that if such malpractices are found, the project will be
disqualified, and the Degree awarded itself will become invalid.

Signature of the student with date


TABLE OF CONTENTS
Abstract

This project is dedicated to improving the performance of speech recognition and audio
processing with the main objective of converting spoken sentences into textual content
and processing audio data in different ways. The system accommodates different types
of information input such as microphone which supports real-time audio, audio file,
video file enabling flexibility in capturing audio data. Notably, the project offers robust
audio conversion capabilities, allowing seamless conversions from MP3 to WAV and
various other formats.

To enhance the scalability and user experience, the system is implemented as a Flask-
powered web application, providing users with a seamless interface accessible through a
Flask-powered web browser serves to facilitate intuitive and user-friendly
communication, making the application versatile, different users -and adaptable to
features here the first. This comprehensive design meets the needs of users looking for
efficient speech recognition and audio processing solutions, especially for web-based
applications.

Keywords:
Flask, PyDub,Speech Recognition,Audio Processing,Web Application,Audio
Conversion,User-friendly Interface
Chapter 1

Introduction

1. Introduction

The Speech to Text Transcript project is a comprehensive solution that aims to convert
speech into transcribed text using advanced speech recognition and audio processing
techniques are developed in Python including specialized libraries such as
SpeechRecognition and PyDub, the project offers versatile input support for
microphones, audio files and video files.

The Speech to Text Transcript has several key features that make its capabilities
versatile and useful. First, it facilitates seamless audio file conversion between formats,
ensuring compatibility with a wide range of input formats. It includes support for
popular formats such as MP3 to WAV, which offers flexibility in handling audio inputs.
In addition, the project has been implemented as a user-friendly web application using
Flask, making it easy to access through a web browser. The adoption of Flask as the
underlying technology makes the process simple and user-friendly.
The addition of speech recognition is a key feature, leveraging the SpeechRecognition
library. This powerful module supports microphone insertion and audio file entry,
enabling accurate and efficient speech-to-text conversion. Integration with Google Web
Speech API further enhances discovery capabilities.
Extending its functionality to video files, the service allows users to annotate comments
from videos. It intelligently extracts audio from video files and uses speech recognition,
extending its usefulness.
From a code perspective, the main functionality is contained in the app.py file. This file
configures features such as extracting audio from video, converting MP3 to WAV using
PyDub, and setting speech recognition.
1.1 Objectives

The main goal of this project is to transform the speech-writing process by developing
robust systems that can accurately convert spoken language into written text. This effort
includes overcoming challenges in speech recognition and writing of available
solutions, aiming to provide solutions that not only meet industry standards but exceed
it.

Problem description

The cutting-edge problem revolves across the inefficiencies and manual efforts related
to audio and video processing obligations. The existing techniques lack automation,
resulting in cumbersome workflows and a loss of person-friendly interfaces. Users are
burdened with manual conversion tasks, leading to productivity challenge for content
material creators and professionals. Integration problems further compound the
problem, stopping seamless collaboration among exceptional processing
responsibilities.

The method of operation

The project uses a multi-stage approach to solve the aforementioned problem,. Utilizing
state-of-the-art speech recognition libraries and frameworks, the program explores
sophisticated analysis of audio sources including microphones and audio/video files
Integration of machine learning algorithms enables the system to adapt to different
sound and speech systems, ensuring high audio accuracy in transcription.

Summary of findings

The culmination of this project is a comprehensive speech-text writing system that not
only meets the defined objectives but also sets new benchmarks for accuracy,
flexibility, and user-friendliness Conclusion The system competencies in in the fields of
linguistic input and access, content production, linguistics and other areas exhibit the
potential for great influence.

1.2 Scope of the Project


The scope of this mission extends to supplying an advanced and reliable answer for
speech-to-text transcription. It encompasses the development of a machine capable of
correctly transcribing spoken language from numerous resources, addressing the
constraints of existing answers. The undertaking's focus consists of but is not restrained
to enhancing accuracy, adaptability, and consumer experience in speech popularity and
transcription.
Chapter 2
Literature Survey/ Problem Statement

2.1 Literature Survey


The "Adapting Large Language Model with Speech for Fully Formatted End-To-
End Speech Recognition" presented by Shaoshi Ling, Yuxuan Hu and etal. This
research builds upon the existing landscape of end-to-end (E2E) speech recognition
models, which typically consist of encoder and decoder blocks for acoustic and
language modeling functions. While pretrained large language models (LLMs) have
demonstrated potential to enhance E2E automatic speech recognition (ASR)
performance, integrating them has faced challenges due to mismatches between text-
based LLMs and those used in E2E ASR. This paper takes a novel approach by
adapting pretrained LLMs to the domain of speech. The authors explore two model
architectures: an encoder-decoder-based LLM and a decoder-only-based LLM. The
proposed models leverage the strengths of both speech and language models while
minimizing architectural changes. Notably, the study introduces a CTC-based down-
sampling method for speech representation to address length discrepancies between
speech and text representations. Experimental results on fully-formatted E2E ASR
transcription tasks across diverse domains demonstrate the effectiveness of the
approach, showcasing improved recognition error rates and addressing formatting
nuances such as punctuation and capitalization. This research contributes to the
exploration of pretrained LLMs in the context of E2E ASR, highlighting potential
advancements in readable and translatable ASR transcriptions across various domains.

The "End-End Speech-to-Text translation with modality agnostic meta-learning"


presented by Sathish Indurthi,Houjeung Han and etal. This research addresses the
challenge of training end-to-end Speech Translation (ST) models with limited data, a
common issue in the field due to the difficulty in collecting large parallel speech-to-text
datasets.
The authors propose a novel modality-agnostic meta-learning approach that leverages
transfer learning from source tasks such as Automatic Speech Recognition (ASR) and
Machine Translation (MT). Unlike previous transfer learning methods, the proposed
approach employs a meta-learning algorithm, specifically the Model-Agnostic Meta-
Learning (MAML) algorithm, to update parameters in a way that serves as a robust
initialization for the target ST task. This meta-learning strategy is applied to tasks with
different input modalities (ASR with speech input and MT with text input). The authors
evaluate their approach on English-German and English-French ST tasks,
demonstrating substantial improvements over previous transfer learning and multi-task
learning methods. The results highlight the effectiveness of the proposed meta-learning
approach in overcoming the challenges of data scarcity for ST tasks, setting new state-
of-the-art results in terms of BLEU scores for both language pairs. The article provides
valuable insights into the potential of meta-learning for addressing data limitations in
end-to-end ST models.

The "fairseq S2T: Fast Speech-to-Text Modeling with fairseq" presented by


Changhan Wang, Yun Tang and etal. This research introduces FAIRSEQ S2T, an
extension of the FAIRSEQ toolkit designed for speech-to-text (S2T) tasks,
encompassing end-to-end modeling for speech recognition and speech-to-text
translation. The authors highlight the growing importance of end-to-end sequence-to-
sequence (S2S) models in the realm of S2T applications, citing their success in
automatic speech recognition (ASR) and the resurgence of speech-to-text translation
(ST) research. The paper emphasizes the interconnectedness of ASR, ST, machine
translation (MT), and language modeling (LM), advocating for comprehensive S2S
modeling toolkits to address the evolving landscape of these tasks. FAIRSEQ S2T
integrates carefully designed RNN-based, Transformer-based, and Conformer-based
models, supporting both online and offline inference. The toolkit also facilitates multi-
task learning and transfer learning by seamlessly incorporating FAIRSEQ's machine
translation models and language models.
The authors provide detailed training recipes, data pre-processing workflows, and
evaluation metrics, positioning FAIRSEQ S2T as a scalable, extensible, and versatile
toolkit for S2T applications. The article concludes with an overview of features, a
comparison with counterpart toolkits, and performance evaluations on benchmark
datasets, showcasing the toolkit's competitiveness in ASR and ST tasks.

The"Arabic Automatic Speech Recognition: A Systematic Literature Review"


presented by Amira Dhouib, Achraf Othman and etal. The systematic literature review
(SLR) comprehensively explores the landscape of Automatic Speech Recognition
(ASR) with a specific focus on the Arabic language. Covering a period from 2011 to
2021, the study addresses seven key research questions to shed light on the trends and
advancements in Arabic ASR research. The authors identified 38 relevant studies
across five databases that met their inclusion criteria. The results showcase the
utilization of various open-source toolkits for Arabic ASR, with KALDI, HTK, and
CMU Sphinx emerging as the most prominent ones. Notably, the review highlights the
predominant use of Modern Standard Arabic (MSA) in 89.47% of the studies, while
26.32% explore different Arabic dialects. Feature extraction techniques, particularly
Mel Frequency Cepstral Coefficient (MFCC) and Hidden Markov Model (HMM), play
a pivotal role, with 63% of papers relying on MFCC. The performance of Arabic ASR
systems is intricately linked to factors such as resource availability, acoustic modeling
techniques, and the characteristics of utilized datasets. The study not only provides a
snapshot of the current state of Arabic ASR research but also identifies existing gaps
and proposes directions for future exploration in this field. Overall, this SLR serves as a
valuable resource for researchers and academics, offering insights into the nuances of
Arabic ASR and paving the way for future advancements in this domain.
The "Leveraging weakly supervised data to improve end-to-end speech-to-text
translation" presented by Jia, Y., Johnson, and etal. This research delves into the realm
of end-to-end Speech Translation (ST) models, emphasizing their potential advantages
over traditional cascaded models involving Automatic Speech Recognition (ASR) and
text Machine Translation (MT). The study acknowledges the challenges in training
robust end-to-end ST models due to the scarcity of large parallel corpora containing
speech and translated transcript pairs. It builds upon prior research efforts that leverage
pre-trained components and multi-task learning to utilize weakly supervised training
data, such as speech-to-transcript or text-to-foreign-text pairs. The authors propose a
novel approach, demonstrating that using pre-trained MT or text-to-speech (TTS)
synthesis models to convert weakly supervised data into speech-to-translation pairs for
ST training can be more effective than multi-task learning. Furthermore, the study
explores the feasibility of training high-quality end-to-end ST models using only weakly
supervised datasets and synthetic data sourced from unlabeled monolingual text or
speech. The research also addresses methods for avoiding overfitting to synthetic
speech through a quantitative ablation study. This comprehensive exploration
contributes valuable insights into the development of end-to-end ST systems,
emphasizing the role of pre-trained components and synthetic data in enhancing
performance and addressing data scarcity issues.
2.2 Problem statement

The cutting-edge problem revolves across the inefficiencies and manual efforts related
to audio and video processing obligations. The existing techniques lack automation,
resulting in cumbersome workflows and a loss of person-friendly interfaces. Users are
burdened with manual conversion tasks, leading to productivity challenge for content
material creators and professionals. Integration problems further compound the
problem, stopping seamless collaboration among exceptional processing
responsibilities.

Our project's goal is to address those challenges by using streamlining audio and video
report conversion and manipulation. The solution involves the development of a
consumer-friendly platform that automates responsibilities and presents a unified
interface. This initiative aims to enhance ordinary performance, reduce the effort and
time required for processing, and beautify the person experience in audio and video
content control.

To reveal this solution, we have implemented a Python code utilizing the Flask
framework. The code includes features for changing video to audio, changing MP3 to
WAV, and integrating speech popularity abilities. Users can engage with the system
thru an internet interface, choosing enter types along with microphone, audio report, or
video document. The gadget then automates the processing responsibilities, showcasing
a sensible implementation of the proposed solution.
Chapter 3

Proposed Method / Algorithm / Architecture / Process / Methodology /


Project Description

3.1 Proposed Method

The innovative approach outlined in this proposal revolves around the implementation
of a versatile speech recognition system designed to accommodate a diverse range of
audio inputs. The system's functionality extends across four primary input modalities:
microphone, audio file, video file, and camera. Leveraging the capabilities of the
speech_recognition library, the system seamlessly integrates with the Google Web
Speech API to perform advanced speech recognition tasks. The core operational
sequence involves the recording of audio based on the selected input type, subsequent
processing of the audio data, and the application of sophisticated algorithms to
recognize and transcribe the speech accurately. This comprehensive methodology
ensures the adaptability of the system to various input sources, marking a significant
stride towards the development of an inclusive and effective speech recognition
solution.

3.2 Algorithm

The implemented system incorporates a sophisticated algorithm for speech recognition,


leveraging the SpeechRecognition library to seamlessly process diverse audio inputs.
The algorithm accommodates four primary input methods, namely microphone, audio
file, video file, and camera, providing a comprehensive solution for different user
scenarios. The speech_recognition library interfaces with the Google Web Speech API,
enhancing the system's accuracy and efficiency in converting spoken language into text.

Moreover, the PyDub library is employed to facilitate the conversion of audio files,
specifically transforming MP3 files to the WAV format. This versatile audio conversion
algorithm ensures compatibility with various audio sources, contributing to the project's
flexibility.
In the realm of web development, Flask serves as the chosen framework for
implementing the user-friendly web application. The utilization of Flask not only
streamlines the development process but also enhances the accessibility of the system
through web browsers, contributing to its user-centric design.

The dynamic display of input options based on the selected input type, orchestrated by
the script.js file using JavaScript, adds an interactive layer to the user interface. This
algorithmic approach enhances the overall user experience, providing a responsive and
intuitive interface for users interacting with the system.

Furthermore, the incorporation of MoviePy for audio extraction from video files
expands the project's capabilities to handle spoken content within videos. This
algorithmic component ensures efficient processing of audio data embedded in video
files, contributing to the project's versatility in speech-to-text transcription from
multimedia sources.

In essence, the project intricately integrates multiple algorithms and libraries, each
contributing to its overarching objective of providing a robust and adaptable solution for
speech-to-text transcription across a spectrum of input scenarios.
3.3 Architecture

System Architecture
Figure 1

3.5 Process
The Speech to Text Transcript system operates through a user-friendly web application
accessible via a web browser. Upon entering the application, users are presented with
various input options, including Microphone, Audio File, and Video File. The user
selects their preferred input method and interacts with the system accordingly. In the
case of microphone input, users speak into the microphone, initiating the speech
recognition process. For audio file input, users upload an MP3 file, triggering the
system to convert it to WAV and transcribe the speech. Similarly, video file input
involves uploading a video file, with the system extracting audio, converting it, and
generating a transcript. The system intelligently employs the SpeechRecognition library,
interfacing with the Google Web Speech API to ensure accurate speech-to-text
conversion. The generated transcript is then displayed on the web interface, providing
users with the converted textual representation of the spoken content. Users can further
explore additional features, such as the system's compatibility with various audio
formats and its video processing capabilities. The user-friendly design ensures a
seamless experience, allowing users to download or save the transcript as needed,
contributing to the system's adaptability and usability.
3.6 Methodology

The methodology adopted for the development of the Speech to Text Transcript system
is centered around a systematic and comprehensive approach to enable accurate and
versatile speech recognition and transcription. The system's key features include the
accommodation of various audio sources, such as microphones, audio files, and video
files, ensuring flexibility and adaptability in handling diverse input scenarios. The
process involves extracting audio from video files using the MoviePy library and
converting MP3 audio files to the WAV format with the PyDub library, standardizing
inputs for the subsequent speech recognition phase.

The SpeechRecognition library is utilized to capture and process audio data, enabling
real-time transcription for microphone input and transcription of pre-recorded content
for audio and video files. Integration with the Google Web Speech API enhances the
system's performance by leveraging Google's robust speech recognition capabilities,
ensuring accurate conversion of speech to text. The user interacts with the system
through a user-friendly web interface implemented with Flask, where they can choose
their preferred input method, initiate the speech recognition process, and conveniently
access the generated transcript. The underlying code, encapsulated in the `app.py` file,
orchestrates tasks such as audio extraction, conversion, and speech recognition, while
the accompanying `script.js` file enhances the user interface by dynamically displaying
input options based on the selected input type. This methodology ensures the system's
efficiency, accuracy, and user-friendly experience in speech-to-text conversion.
3.7 Project Description

The Speech to Text Transcript project is a comprehensive program designed to improve


speech recognition and transcription from a variety of audio sources. The main
objective is to convert spoken language into written text, making it accessible and
usable for a wide range of applications. Key features of the project include audio
conversion, web application development using Flask for ease of use, complex speech
recognition using the SpeechRecognition library and Google Web Speech API, and
video processing capabilities for accurate transcription from video files.

Core functionality is contained in the `app.py` file, orchestrating functionality such as


audio filtering, format conversion, and speech recognition. In addition, the `script.js`
file enhances the user interface by dynamically displaying input options based on the
selected input type.

This service caters to a broad audience, offering versatile tools that cover tasks ranging
from simple audio conversions to complex speech transcriptions from a variety of
sources and intuitive user interface, with speech recognition systems a it works well,
making it a valuable asset for a variety of language and text writing needs.
Chapter 4
Implementation / Code / Results and Description of Results

4.1 Implementation Process


We have used python programming language to build our project in the backend. This
implements a web application for speech recognition using Flask, a Python web
framework, and various libraries such as SpeechRecognition, PyDub, and MoviePy.
The application allows users to choose between different input types: microphone,
audio file, or video file. The front-end is designed using HTML and CSS, and the logic
for handling user interactions is implemented in JavaScript.

Here's a breakdown of the implementation process:


1. HTML and CSS (index.html):
The HTML file defines the structure of the web page, including a form with options
to select the input type (microphone, audio file, or video file).
It uses CSS to style the page and link to an external stylesheet (styles.css) for
additional styling.

2. JavaScript (script.js):
The JavaScript file is responsible for dynamically displaying and hiding input
options based on the selected input type. It listens for changes in the input type
dropdown and adjusts the visibility of input divs accordingly.
It also provides a theme switcher button to switch between light and dark themes.

3. Flask Backend (app.py):


The Flask application is set up with routes to handle both GET and POST requests.
The main route ('/') renders the index.html template.
The server-side logic is implemented in Python, utilizing the SpeechRecognition
library for speech-to-text functionality and other libraries for audio and video
processing.
The index route handles form submissions, extracting audio from the selected
input type (microphone, audio file, or video file), and performing speech
recognition using Google Web Speech API. The resulting transcript is displayed
on the web page.

4. Audio and Video Processing (app.py):


The application includes functions to extract audio from a video file and convert
MP3 audio to WAV format using MoviePy and PyDub libraries, respectively.
It uses a pre-defined trigger and response for an easter egg scenario, where if the
recognized speech contains a specific phrase, the transcript is replaced with a
different response.

5. Running the Application:


The application is run using app.run(debug=True) in the __main__ block. This starts
the Flask development server. Users can access the web application through a web
browser, interact with the input form, and observe the real-time transcription based
on the selected input type.

In summary, this implementation integrates front-end elements with a Flask backend


to create a user-friendly web application for speech recognition, supporting different
input sources. The server-side logic leverages various libraries to handle audio and
video processing, making it a comprehensive solution for speech recognition tasks.

4.2 Code
The code consists of 4 files which consists of 1 python file which consists of
SpeechRecognition, PyDub, and MoviePy. There consists 1 HTML file in templates
folder from which the flask routes to, and we got static folder in which we store our
CSS file the design for the HTML files and finally we have got JS file for the
functionality of the whole S2T website.
app.py
import os
import speech_recognition as sr
from pydub import AudioSegment
import moviepy.editor as mp
from flask import Flask, render_template, request

app = Flask(__name__)

# Function to convert video to audio and return the path to the audio file
def extract_audio_from_video(video_path):
video = mp.VideoFileClip(video_path)
audio = video.audio
audio_path = "temp_audio.wav"
audio.write_audiofile(audio_path)
video.close() # Close the video file
return audio_path

# Function to convert MP3 to WAV using PyDub


def convert_mp3_to_wav(mp3_file, wav_file):
audio = AudioSegment.from_mp3(mp3_file)
audio.export(wav_file, format="wav")

# Set up the speech recognizer


r = sr.Recognizer()
easter_egg_trigger = "The weather is lovely today"
random_phrase = "I heard there might be rain later"
@app.route('/', methods=['GET', 'POST'])
def index():
transcript = ""
if request.method == 'POST':
input_type = request.form.get('input_type')

if input_type == 'microphone':
# Microphone input
with sr.Microphone() as source:
print("Say something...")
audio = r.listen(source)
try:
transcript = r.recognize_google(audio)
if easter_egg_trigger.lower() in transcript.lower():
transcript = "VANI"
except sr.UnknownValueError:
transcript = "Speech Recognition could not understand audio"
except sr.RequestError as e:
transcript = f"Could not request results from Google Web Speech API;
{e}"
elif input_type == 'audio_file':
# Audio file input (provide the path to your MP3 audio file)
mp3_file_path = request.files['audio_file']
if mp3_file_path:
mp3_file_path.save("temp_audio.mp3")
convert_mp3_to_wav("temp_audio.mp3", "temp_audio.wav")
with sr.AudioFile("temp_audio.wav") as source:
audio = r.record(source)
try:
transcript = r.recognize_google(audio)
except sr.UnknownValueError:
transcript = "Speech Recognition could not understand audio"
except sr.RequestError as e:
transcript = f"Could not request results from Google Web Speech API;
{e}"

elif input_type == 'video_file':


# Video file input (provide the path to your video file)
video_file = request.files['video_file']
if video_file:
video_file_path = "temp_video.mp4" # You can save the file temporarily
video_file.save(video_file_path)

# Extract audio from video and get the audio file path
audio_file_path = extract_audio_from_video(video_file_path)

with sr.AudioFile(audio_file_path) as source:


audio = r.record(source)

try:
transcript = r.recognize_google(audio)
except sr.UnknownValueError:
transcript = "Speech Recognition could not understand audio"
except sr.RequestError as e:
transcript = f"Could not request results from Google Web Speech API;
{e}"
finally:
os.remove(video_file_path)
os.remove(audio_file_path)

return render_template('index.html', transcript=transcript)

if __name__ == '__main__':
app.run(debug=True)

index.html
<!DOCTYPE html>
<html>
<head>
<title>Speech Recognition</title>
<link
rel="stylesheet"
type="text/css"
href="{{ url_for('static', filename='css/styles.css') }}"
/>
</head>
<body>
<div id="top-left">
<span id="restart" onclick="restartSpeechToText()">S2T</span>
<button id="theme-switcher">Switch Theme</button>
</div>

<form method="POST" enctype="multipart/form-data">


<label for="input_type">Select Input Type:</label>
<select id="input_type" name="input_type">
<option>Select input type</option>
<option value="microphone">Microphone</option>
<option value="audio_file">Audio File</option>
<option value="video_file">Video File</option>
</select>
<br /><br />

<!-- Microphone input -->


<div id="microphone_input" class="input_div">
<button type="submit" name="microphone">Start Microphone Input</button>
</div>

<!-- Audio file input -->


<div id="audio_file_input" class="input_div">
<label for="audio_file">Upload Audio File:</label>
<input type="file" id="audio_file" name="audio_file" />
<button type="submit" name="audio">Recognize Audio</button>
</div>

<!-- Video file input -->


<div id="video_file_input" class="input_div">
<label for="video_file">Upload Video File:</label>
<input type="file" id="video_file" name="video_file" />
<button type="submit" name="video">Recognize Video</button>
</div>
</form>

<hr/>
<h2>Transcript:</h2>
<pre>{{ transcript }}</pre>
<script src="{{ url_for('static', filename='js/script.js') }}"></script>
</body>
</html>

styles.css
:root[data-theme="light"] {
--text: #020303;
--background: #f8fafb;
--primary: #664c38;
--secondary: #d0ddc0;
--accent: #87654a;
}
:root[data-theme="dark"] {
--text: #fcfdfd;
--background: #040606;
--primary: #c7ad99;
--secondary: #323f22;
--accent: #b59378;
}

body {
font-family: 'Arial', sans-serif;
background-color: var(--background);
margin: 0;
padding: 0;
text-align: center;
color: var(--text);
}
.container {
width: 80%;
margin: auto;
}

@media only screen and (max-width: 600px) {


.container {
width: 100%;
}
}

#top-left {
position: absolute;
top: 10px;
left: 10px;
}

#restart {
font-family: 'Verdana', sans-serif;
color: var(--primary);
cursor: pointer;
margin-right: 10px;
}

#theme-switcher {
cursor: pointer;
background-color: var(--primary);
color: var(--text);
border: none;
border-radius: 5px;
padding: 12px 24px;
cursor: pointer;
transition: background-color 0.3s ease;
font-size: 16px;
}

h2 {
font-size: 24px;
color: var(--primary);
}

form {
background-color: var(--accent);
border: 1px solid var(--secondary);
border-radius: 10px;
padding: 20px;
margin: 20px auto;
max-width: 600px;
box-shadow: 0 0 10px rgba(0, 0, 0, 0.1);
}

label {
display: block;
margin-bottom: 10px;
font-size: 18px;
color: var(--text);
}

select, input[type="file"] {
padding: 10px;
margin-bottom: 20px;
border: 1px solid var(--secondary);
border-radius: 5px;
background-color: var(--background);
color: var(--text);
font-size: 16px;
}

button[type="submit"] {
background-color: var(--primary);
color: var(--text);
border: none;
border-radius: 5px;
padding: 12px 24px;
cursor: pointer;
transition: background-color 0.3s ease;
font-size: 16px;
}

button[type="submit"]:hover {
background-color: var(--secondary);
}

pre {
background-color: var(--accent);
border: 1px solid var(--secondary);
border-radius: 10px;
padding: 20px;
text-align: left;
white-space: pre-wrap;
font-size: 16px;
color: var(--text);
box-shadow: 0 0 10px rgba(0, 0, 0, 0.1);
}

script.js
window.onload = function() {
// Hide all input options initially
const inputDivs = document.getElementsByClassName("input_div");
for (let div of inputDivs) {
div.style.display = "none";
}
}

document.getElementById("input_type").addEventListener("change", function () {
const selectedInput = this.value;
const inputDivs = document.getElementsByClassName("input_div");

for (let div of inputDivs) {


div.style.display = "none";
}

switch (selectedInput) {
case "microphone":
document.getElementById("microphone_input").style.display = "block";
break;
case "audio_file":
document.getElementById("audio_file_input").style.display = "block";
break;
case "video_file":
document.getElementById("video_file_input").style.display = "block";
break;
}
});
document.getElementById("theme-switcher").addEventListener("click", function ()
{
let currentTheme = document.documentElement.getAttribute("data-theme");
if (currentTheme === "light") {
document.documentElement.setAttribute("data-theme", "dark");
} else {
document.documentElement.setAttribute("data-theme", "light");
}
});

4.3 Results

Home page
Figure 2
(
Microphone output)
Figure 3

(Video file transcription)


Figure 4
5. Conclusion
In summary, the implemented speech recognition system is an exceptionally
versatile tool, capable of handling a variety of inputs including microphones, audio
files, video files and cameras handle Leveraging the speech_recognition library, the
system captures, processes and accurately recognizes spoken words , provides a
comprehensive solution for various applications The addition of features such as
error handling ensures robustness, and increases system reliability in in a real-world
setting.

The functionality of the system is also highlighted by the integration of Flask, which
enables web-based interaction. Users can select their preferred input method and
participate in the system easily using an intuitive interface. The flexibility and
interactivity of the system positions it as a valuable tool for a wide range of speech
recognition applications, and provides an intuitive and dynamic experience for those
seeking more accurate and responsive speech reading capabilities Overall a, this
speech recognition system, with its robust features , and its customizable structure,
stands as a powerful and user-centered application in artificial intelligence

You might also like