7sem_projectreport
7sem_projectreport
SUBMITTED BY
Aniketh Vustepalle
11209A021
Ramacharla.Sai Teja
11209A013
GUIDED BY
Dr. M. Senthil Kumaran
Associate Professor
Dept. of CSE
Nov 2023
Sri Chandrasekharendra Saraswathi Viswa Mahavidyalaya
Enathur, Kanchipuram – 631 561
BONAFIDE CERTIFICATE
This is to certify that the PROJECT WORK PHASE-I Report entitled Speech to Text
Transcript is the bonafide work carried out by Mr. Vustepalle Aniketh &
Ramacharla.Sai Teja Reg.No 11209A021 & 11209A013 during the academic year
------------------------.
SCSVMV. SCSVMV.
Submitted for the project work viva - voce examination held on _________________
Place : Kanchipuram.
Date :
Examiner 1 Examiner 2
DECLARATION
It is certified that the PROJECT WORK PHASE-I work titled Speech to Text Transcript
is originally implemented by me. No ideas, processes, results or words of others have
been presented as our own work. Due acknowledgement is given wherever others' work
or ideas are utilized.
We understand that the project is liable to be rejected at any stage (even at a later
date) if it is discovered that the project has been plagiarized, or significant code has
been copied. We understand that if such malpractices are found, the project will be
disqualified, and the Degree awarded itself will become invalid.
This project is dedicated to improving the performance of speech recognition and audio
processing with the main objective of converting spoken sentences into textual content
and processing audio data in different ways. The system accommodates different types
of information input such as microphone which supports real-time audio, audio file,
video file enabling flexibility in capturing audio data. Notably, the project offers robust
audio conversion capabilities, allowing seamless conversions from MP3 to WAV and
various other formats.
To enhance the scalability and user experience, the system is implemented as a Flask-
powered web application, providing users with a seamless interface accessible through a
Flask-powered web browser serves to facilitate intuitive and user-friendly
communication, making the application versatile, different users -and adaptable to
features here the first. This comprehensive design meets the needs of users looking for
efficient speech recognition and audio processing solutions, especially for web-based
applications.
Keywords:
Flask, PyDub,Speech Recognition,Audio Processing,Web Application,Audio
Conversion,User-friendly Interface
Chapter 1
Introduction
1. Introduction
The Speech to Text Transcript project is a comprehensive solution that aims to convert
speech into transcribed text using advanced speech recognition and audio processing
techniques are developed in Python including specialized libraries such as
SpeechRecognition and PyDub, the project offers versatile input support for
microphones, audio files and video files.
The Speech to Text Transcript has several key features that make its capabilities
versatile and useful. First, it facilitates seamless audio file conversion between formats,
ensuring compatibility with a wide range of input formats. It includes support for
popular formats such as MP3 to WAV, which offers flexibility in handling audio inputs.
In addition, the project has been implemented as a user-friendly web application using
Flask, making it easy to access through a web browser. The adoption of Flask as the
underlying technology makes the process simple and user-friendly.
The addition of speech recognition is a key feature, leveraging the SpeechRecognition
library. This powerful module supports microphone insertion and audio file entry,
enabling accurate and efficient speech-to-text conversion. Integration with Google Web
Speech API further enhances discovery capabilities.
Extending its functionality to video files, the service allows users to annotate comments
from videos. It intelligently extracts audio from video files and uses speech recognition,
extending its usefulness.
From a code perspective, the main functionality is contained in the app.py file. This file
configures features such as extracting audio from video, converting MP3 to WAV using
PyDub, and setting speech recognition.
1.1 Objectives
The main goal of this project is to transform the speech-writing process by developing
robust systems that can accurately convert spoken language into written text. This effort
includes overcoming challenges in speech recognition and writing of available
solutions, aiming to provide solutions that not only meet industry standards but exceed
it.
Problem description
The cutting-edge problem revolves across the inefficiencies and manual efforts related
to audio and video processing obligations. The existing techniques lack automation,
resulting in cumbersome workflows and a loss of person-friendly interfaces. Users are
burdened with manual conversion tasks, leading to productivity challenge for content
material creators and professionals. Integration problems further compound the
problem, stopping seamless collaboration among exceptional processing
responsibilities.
The project uses a multi-stage approach to solve the aforementioned problem,. Utilizing
state-of-the-art speech recognition libraries and frameworks, the program explores
sophisticated analysis of audio sources including microphones and audio/video files
Integration of machine learning algorithms enables the system to adapt to different
sound and speech systems, ensuring high audio accuracy in transcription.
Summary of findings
The culmination of this project is a comprehensive speech-text writing system that not
only meets the defined objectives but also sets new benchmarks for accuracy,
flexibility, and user-friendliness Conclusion The system competencies in in the fields of
linguistic input and access, content production, linguistics and other areas exhibit the
potential for great influence.
The cutting-edge problem revolves across the inefficiencies and manual efforts related
to audio and video processing obligations. The existing techniques lack automation,
resulting in cumbersome workflows and a loss of person-friendly interfaces. Users are
burdened with manual conversion tasks, leading to productivity challenge for content
material creators and professionals. Integration problems further compound the
problem, stopping seamless collaboration among exceptional processing
responsibilities.
Our project's goal is to address those challenges by using streamlining audio and video
report conversion and manipulation. The solution involves the development of a
consumer-friendly platform that automates responsibilities and presents a unified
interface. This initiative aims to enhance ordinary performance, reduce the effort and
time required for processing, and beautify the person experience in audio and video
content control.
To reveal this solution, we have implemented a Python code utilizing the Flask
framework. The code includes features for changing video to audio, changing MP3 to
WAV, and integrating speech popularity abilities. Users can engage with the system
thru an internet interface, choosing enter types along with microphone, audio report, or
video document. The gadget then automates the processing responsibilities, showcasing
a sensible implementation of the proposed solution.
Chapter 3
The innovative approach outlined in this proposal revolves around the implementation
of a versatile speech recognition system designed to accommodate a diverse range of
audio inputs. The system's functionality extends across four primary input modalities:
microphone, audio file, video file, and camera. Leveraging the capabilities of the
speech_recognition library, the system seamlessly integrates with the Google Web
Speech API to perform advanced speech recognition tasks. The core operational
sequence involves the recording of audio based on the selected input type, subsequent
processing of the audio data, and the application of sophisticated algorithms to
recognize and transcribe the speech accurately. This comprehensive methodology
ensures the adaptability of the system to various input sources, marking a significant
stride towards the development of an inclusive and effective speech recognition
solution.
3.2 Algorithm
Moreover, the PyDub library is employed to facilitate the conversion of audio files,
specifically transforming MP3 files to the WAV format. This versatile audio conversion
algorithm ensures compatibility with various audio sources, contributing to the project's
flexibility.
In the realm of web development, Flask serves as the chosen framework for
implementing the user-friendly web application. The utilization of Flask not only
streamlines the development process but also enhances the accessibility of the system
through web browsers, contributing to its user-centric design.
The dynamic display of input options based on the selected input type, orchestrated by
the script.js file using JavaScript, adds an interactive layer to the user interface. This
algorithmic approach enhances the overall user experience, providing a responsive and
intuitive interface for users interacting with the system.
Furthermore, the incorporation of MoviePy for audio extraction from video files
expands the project's capabilities to handle spoken content within videos. This
algorithmic component ensures efficient processing of audio data embedded in video
files, contributing to the project's versatility in speech-to-text transcription from
multimedia sources.
In essence, the project intricately integrates multiple algorithms and libraries, each
contributing to its overarching objective of providing a robust and adaptable solution for
speech-to-text transcription across a spectrum of input scenarios.
3.3 Architecture
System Architecture
Figure 1
3.5 Process
The Speech to Text Transcript system operates through a user-friendly web application
accessible via a web browser. Upon entering the application, users are presented with
various input options, including Microphone, Audio File, and Video File. The user
selects their preferred input method and interacts with the system accordingly. In the
case of microphone input, users speak into the microphone, initiating the speech
recognition process. For audio file input, users upload an MP3 file, triggering the
system to convert it to WAV and transcribe the speech. Similarly, video file input
involves uploading a video file, with the system extracting audio, converting it, and
generating a transcript. The system intelligently employs the SpeechRecognition library,
interfacing with the Google Web Speech API to ensure accurate speech-to-text
conversion. The generated transcript is then displayed on the web interface, providing
users with the converted textual representation of the spoken content. Users can further
explore additional features, such as the system's compatibility with various audio
formats and its video processing capabilities. The user-friendly design ensures a
seamless experience, allowing users to download or save the transcript as needed,
contributing to the system's adaptability and usability.
3.6 Methodology
The methodology adopted for the development of the Speech to Text Transcript system
is centered around a systematic and comprehensive approach to enable accurate and
versatile speech recognition and transcription. The system's key features include the
accommodation of various audio sources, such as microphones, audio files, and video
files, ensuring flexibility and adaptability in handling diverse input scenarios. The
process involves extracting audio from video files using the MoviePy library and
converting MP3 audio files to the WAV format with the PyDub library, standardizing
inputs for the subsequent speech recognition phase.
The SpeechRecognition library is utilized to capture and process audio data, enabling
real-time transcription for microphone input and transcription of pre-recorded content
for audio and video files. Integration with the Google Web Speech API enhances the
system's performance by leveraging Google's robust speech recognition capabilities,
ensuring accurate conversion of speech to text. The user interacts with the system
through a user-friendly web interface implemented with Flask, where they can choose
their preferred input method, initiate the speech recognition process, and conveniently
access the generated transcript. The underlying code, encapsulated in the `app.py` file,
orchestrates tasks such as audio extraction, conversion, and speech recognition, while
the accompanying `script.js` file enhances the user interface by dynamically displaying
input options based on the selected input type. This methodology ensures the system's
efficiency, accuracy, and user-friendly experience in speech-to-text conversion.
3.7 Project Description
This service caters to a broad audience, offering versatile tools that cover tasks ranging
from simple audio conversions to complex speech transcriptions from a variety of
sources and intuitive user interface, with speech recognition systems a it works well,
making it a valuable asset for a variety of language and text writing needs.
Chapter 4
Implementation / Code / Results and Description of Results
2. JavaScript (script.js):
The JavaScript file is responsible for dynamically displaying and hiding input
options based on the selected input type. It listens for changes in the input type
dropdown and adjusts the visibility of input divs accordingly.
It also provides a theme switcher button to switch between light and dark themes.
4.2 Code
The code consists of 4 files which consists of 1 python file which consists of
SpeechRecognition, PyDub, and MoviePy. There consists 1 HTML file in templates
folder from which the flask routes to, and we got static folder in which we store our
CSS file the design for the HTML files and finally we have got JS file for the
functionality of the whole S2T website.
app.py
import os
import speech_recognition as sr
from pydub import AudioSegment
import moviepy.editor as mp
from flask import Flask, render_template, request
app = Flask(__name__)
# Function to convert video to audio and return the path to the audio file
def extract_audio_from_video(video_path):
video = mp.VideoFileClip(video_path)
audio = video.audio
audio_path = "temp_audio.wav"
audio.write_audiofile(audio_path)
video.close() # Close the video file
return audio_path
if input_type == 'microphone':
# Microphone input
with sr.Microphone() as source:
print("Say something...")
audio = r.listen(source)
try:
transcript = r.recognize_google(audio)
if easter_egg_trigger.lower() in transcript.lower():
transcript = "VANI"
except sr.UnknownValueError:
transcript = "Speech Recognition could not understand audio"
except sr.RequestError as e:
transcript = f"Could not request results from Google Web Speech API;
{e}"
elif input_type == 'audio_file':
# Audio file input (provide the path to your MP3 audio file)
mp3_file_path = request.files['audio_file']
if mp3_file_path:
mp3_file_path.save("temp_audio.mp3")
convert_mp3_to_wav("temp_audio.mp3", "temp_audio.wav")
with sr.AudioFile("temp_audio.wav") as source:
audio = r.record(source)
try:
transcript = r.recognize_google(audio)
except sr.UnknownValueError:
transcript = "Speech Recognition could not understand audio"
except sr.RequestError as e:
transcript = f"Could not request results from Google Web Speech API;
{e}"
# Extract audio from video and get the audio file path
audio_file_path = extract_audio_from_video(video_file_path)
try:
transcript = r.recognize_google(audio)
except sr.UnknownValueError:
transcript = "Speech Recognition could not understand audio"
except sr.RequestError as e:
transcript = f"Could not request results from Google Web Speech API;
{e}"
finally:
os.remove(video_file_path)
os.remove(audio_file_path)
if __name__ == '__main__':
app.run(debug=True)
index.html
<!DOCTYPE html>
<html>
<head>
<title>Speech Recognition</title>
<link
rel="stylesheet"
type="text/css"
href="{{ url_for('static', filename='css/styles.css') }}"
/>
</head>
<body>
<div id="top-left">
<span id="restart" onclick="restartSpeechToText()">S2T</span>
<button id="theme-switcher">Switch Theme</button>
</div>
<hr/>
<h2>Transcript:</h2>
<pre>{{ transcript }}</pre>
<script src="{{ url_for('static', filename='js/script.js') }}"></script>
</body>
</html>
styles.css
:root[data-theme="light"] {
--text: #020303;
--background: #f8fafb;
--primary: #664c38;
--secondary: #d0ddc0;
--accent: #87654a;
}
:root[data-theme="dark"] {
--text: #fcfdfd;
--background: #040606;
--primary: #c7ad99;
--secondary: #323f22;
--accent: #b59378;
}
body {
font-family: 'Arial', sans-serif;
background-color: var(--background);
margin: 0;
padding: 0;
text-align: center;
color: var(--text);
}
.container {
width: 80%;
margin: auto;
}
#top-left {
position: absolute;
top: 10px;
left: 10px;
}
#restart {
font-family: 'Verdana', sans-serif;
color: var(--primary);
cursor: pointer;
margin-right: 10px;
}
#theme-switcher {
cursor: pointer;
background-color: var(--primary);
color: var(--text);
border: none;
border-radius: 5px;
padding: 12px 24px;
cursor: pointer;
transition: background-color 0.3s ease;
font-size: 16px;
}
h2 {
font-size: 24px;
color: var(--primary);
}
form {
background-color: var(--accent);
border: 1px solid var(--secondary);
border-radius: 10px;
padding: 20px;
margin: 20px auto;
max-width: 600px;
box-shadow: 0 0 10px rgba(0, 0, 0, 0.1);
}
label {
display: block;
margin-bottom: 10px;
font-size: 18px;
color: var(--text);
}
select, input[type="file"] {
padding: 10px;
margin-bottom: 20px;
border: 1px solid var(--secondary);
border-radius: 5px;
background-color: var(--background);
color: var(--text);
font-size: 16px;
}
button[type="submit"] {
background-color: var(--primary);
color: var(--text);
border: none;
border-radius: 5px;
padding: 12px 24px;
cursor: pointer;
transition: background-color 0.3s ease;
font-size: 16px;
}
button[type="submit"]:hover {
background-color: var(--secondary);
}
pre {
background-color: var(--accent);
border: 1px solid var(--secondary);
border-radius: 10px;
padding: 20px;
text-align: left;
white-space: pre-wrap;
font-size: 16px;
color: var(--text);
box-shadow: 0 0 10px rgba(0, 0, 0, 0.1);
}
script.js
window.onload = function() {
// Hide all input options initially
const inputDivs = document.getElementsByClassName("input_div");
for (let div of inputDivs) {
div.style.display = "none";
}
}
document.getElementById("input_type").addEventListener("change", function () {
const selectedInput = this.value;
const inputDivs = document.getElementsByClassName("input_div");
switch (selectedInput) {
case "microphone":
document.getElementById("microphone_input").style.display = "block";
break;
case "audio_file":
document.getElementById("audio_file_input").style.display = "block";
break;
case "video_file":
document.getElementById("video_file_input").style.display = "block";
break;
}
});
document.getElementById("theme-switcher").addEventListener("click", function ()
{
let currentTheme = document.documentElement.getAttribute("data-theme");
if (currentTheme === "light") {
document.documentElement.setAttribute("data-theme", "dark");
} else {
document.documentElement.setAttribute("data-theme", "light");
}
});
4.3 Results
Home page
Figure 2
(
Microphone output)
Figure 3
The functionality of the system is also highlighted by the integration of Flask, which
enables web-based interaction. Users can select their preferred input method and
participate in the system easily using an intuitive interface. The flexibility and
interactivity of the system positions it as a valuable tool for a wide range of speech
recognition applications, and provides an intuitive and dynamic experience for those
seeking more accurate and responsive speech reading capabilities Overall a, this
speech recognition system, with its robust features , and its customizable structure,
stands as a powerful and user-centered application in artificial intelligence